Spark 2.0 + Kryo Serializer + Avro -> NullPointerException? - apache-spark

I have the simple pyspark program:
from pyspark import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
if __name__ == "__main__":
spark_settings = {
"spark.serializer": 'org.apache.spark.serializer.KryoSerializer'
}
conf = SparkConf()
conf.setAll(spark_settings.items())
spark_context = SparkContext(appName="test app", conf=conf)
spark_sql_context = SQLContext(spark_context)
source_path = "s3n://my_bucket/data.avro"
data_frame = spark_sql_context.read.load(source_path, format="com.databricks.spark.avro")
# The schema comes back correctly.
data_frame.printSchema()
# This count() call fails. A call to head() triggers the same error.
data_frame.count()
I run with
$SPARK_HOME/bin/spark-submit --master yarn \
--packages com.databricks:spark-avro_2.11:3.0.0 \
bug_isolation.py
It fails with the following exception and stack trace.
If I switch to --master local it works. If I disable the KryoSerializer option, it works. Or if I use a Parquet source rather than an Avro source it works.
The combination of using --master yarn and the KryoSerializer and an Avro source triggers the exception and stack trace listed below.
I suspect I may need to manually register some Avro plugin classes with the KryoSerializer for it to work? Which classes would I need to register.
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o58.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 9, ip-172-31-97-24.us-west-2.compute.internal): java.lang.NullPointerException
at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:151)
at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:143)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Related

Getting error while writing parquet files to Azure data lake storage gen 2

Hi I have a usecase where I am reading parquet files and writing it to ADLG Gen 2. This is without any modification to data.
MY Code:
val kustoLogsSourcePath: String = "/mnt/SOME_FOLDER/2023/01/11/fe73f221-b771-49c9-ba7d-2e2af4fe4f2a_1_69fc119b888447efa9ed2ecd7a4ab647.parquet"
val outputPath: String = "/mnt/SOME_FOLDER/2023/01/10/EventLogs1/"
val kustoLogData = spark.read.parquet(kustoLogsSourcePath)
kustoLogData.write.mode(SaveMode.Overwrite).save(outputPath)
I am getting this error, any ideas how to solve it:
Here, I have shared all the exception related messages that I got.
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:110)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:128)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:143)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:183)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:114)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:690)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:690)
at
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 276 in stage 2.0 failed 4 times, most recent failure: Lost task 276.3 in stage 2.0 (TID 351, 10.139.64.13, executor 5): com.databricks.sql.io.FileReadException: Error while reading file dbfs:[REDACTED]/eventlogs/2023/01/10/[REDACTED-FILE-NAME].parquet.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:272)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:256)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:197)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:584)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:634)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:49)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:557)
at
Caused by: com.databricks.sql.io.FileReadException: Error while reading file dbfs:[REDACTED]/eventlogs/2023/01/11/fe73f221-b771-49c9-ba7d-2e2af4fe4f2a_1_69fc119b888447efa9ed2ecd7a4ab647.parquet.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:272)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:256)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:197)
at
Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:584)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:634)
at
It seems that some columns are DELTA_BYTE_ARRAY encoded, a workarround would be to turn off the vectorized reader property:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Try to modify your code and also remove the string parameter in the font of the variable and also use .format("delta") for reading delta file.
%scala
val kustoLogsSourcePath = "/mnt/SOME_FOLDER/2023/01/11/"
val outputPath = "/mnt/SOME_FOLDER/2023/01/10/EventLogs1/"
val kustoLogData = spark.read.format("delta").load(kustoLogsSourcePath)
kustoLogData.write.format("parquet").mode("append").mode(SaveMode.Overwrite).save(outputPath)
For the demo, this is my FileStore location /FileStore/tables/delta_train/.
I reproduce same in my environment as per above code .I got this output.

shc-core: NoSuchMethodError org.apache.hadoop.hbase.client.Put.addColumn

I try to use shc-core to save spark dataframe into hbase via spark.
My versions:
hbase: 1.1.2.2.6.4.0-91
spark: 1.6
scala: 2.10
shc: 1.1.1-1.6-s_2.10
hdp: 2.6.4.0-91
Configuration looks like that:
val schema_array = s"""{"type": "array", "items": ["string","null"]}""".stripMargin
def catalog: String = s"""{
|"table":{"namespace":"default", "name":"tblename"},
|"rowkey":"id",
|"columns":{
|"id":{"cf":"rowkey", "col":"id", "type":"string"},
|"col1":{"cf":"data", "col":"col1", "avro":"schema_array"}
|}
|}""".stripMargin
df
.write
.options(Map(
"schema_array"-> schema_array,
HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable -> "5"
))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
Sometimes it works fine as expected and creates table and saves all the data into hbase. But sometimes just fail with following error:
Lost task 35.0 in stage 9.0 (TID 301, host): java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.addColumn([B[B[B)Lorg/apache/hadoop/hbase/client/Put;
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:211)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:210)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1(HBaseRelation.scala:210)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$insert$1.apply(HBaseRelation.scala:219)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$insert$1.apply(HBaseRelation.scala:219)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1277)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any ideas?
That was actually a class path issue - I've got two different versions of hbase client.

Query on Spark cluster to HBase raise "java.lang.IllegalStateException: unread block data" exception

Our Spark setup is on 3 servers and all can see the HBase cluster servers.
I'm using Hadoop 2.7.3, HBase 1.2.6 and Spark 2.1.3.
I connect to Spark with
/opt/spark/bin/spark-shell --master spark://master:7077
and run the following
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result, Put, HTable}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor, HColumnDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.TableDescriptor
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.conf.Configuration
import scala.collection.JavaConverters._
val conf = HBaseConfiguration.create()
val tablename = "default:Table1"
conf.set(TableInputFormat.INPUT_TABLE,tablename)
val admin = new HBaseAdmin(conf)
admin.isTableAvailable(tablename)
val hBaseRDD = sc.
newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
hBaseRDD.count()
When run on the spark-shell, admin.isTableAvailable(tablename) returns true.
This leads to thinking that Spark can access HBase, but invoking hBaseRDD.count() raises the following error:
java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2280)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2204)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2062)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
2018-07-17 15:58:54,974 ERROR [task-result-getter-3] scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.11.1.12 , executor 2): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2280)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2204)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2062)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1455)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1443)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1941)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1954)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1968)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 52 elided
Caused by: java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2280)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2204)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2062)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
This error occurs when submitting to the cluster while it works correctly when executing the script on the spark-shell.
From some research, it looks like there can be as version mismatch or alternatively due to how I set spark.driver.extraClassPath, which I set to:
/opt/spark/jars/*:/opt/hbase/lib/commons-collections-3.2.2.jar:/opt/hbase/lib/commons-httpclient-3.1.jar:/opt/hbase/lib/findbugs-annotations-1.3.9-1.jar:/opt/hbase/lib/hbase-annotations-1.2.6.jar:/opt/hbase/lib/hbase-annotations-1.2.6-tests.jar:/opt/hbase/lib/hbase-client-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6-tests.jar:/opt/hbase/lib/hbase-examples-1.2.6.jar:/opt/hbase/lib/hbase-external-blockcache-1.2.6.jar:/opt/hbase/lib/hbase-hadoop2-compat-1.2.6.jar:/opt/hbase/lib/hbase-hadoop-compat-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6-tests.jar:/opt/hbase/lib/hbase-prefix-tree-1.2.6.jar:/opt/hbase/lib/hbase-procedure-1.2.6.jar:/opt/hbase/lib/hbase-protocol-1.2.6.jar:/opt/hbase/lib/hbase-resource-bundle-1.2.6.jar:/opt/hbase/lib/hbase-rest-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6-tests.jar:/opt/hbase/lib/hbase-shell-1.2.6.jar:/opt/hbase/lib/hbase-thrift-1.2.6.jar:/opt/hbase/lib/jetty-util-6.1.26.jar:/opt/hbase/lib/ruby/hbase:/opt/hbase/lib/ruby/hbase/hbase.rb:/opt/hbase/lib/ruby/hbase.rb:/opt/hbase/lib/protobuf-java-2.5.0.jar:/opt/hbase/lib/metrics-core-2.2.0.jar:/opt/hbase/lib/htrace-core-3.1.0-incubating.jar:/opt/hbase/lib/guava-12.0.1.jar:/opt/hbase/lib/asm-3.1.jar:/opt/hbase/lib/Cdrpackage.jar:/opt/hbase/lib/commons-daemon-1.0.13.jar:/opt/hbase/lib/commons-el-1.0.jar:/opt/hbase/lib/commons-math-2.2.jar:/opt/hbase/lib/disruptor-3.3.0.jar:/opt/hbase/lib/jamon-runtime-2.4.1.jar:/opt/hbase/lib/jasper-compiler-5.5.23.jar:/opt/hbase/lib/jasper-runtime-5.5.23.jar:/opt/hbase/lib/jaxb-impl-2.2.3-1.jar:/opt/hbase/lib/jcodings-1.0.8.jar:/opt/hbase/lib/jersey-core-1.9.jar:/opt/hbase/lib/jersey-guice-1.9.jar:/opt/hbase/lib/jersey-json-1.9.jar:/opt/hbase/lib/jettison-1.3.3.jar:/opt/hbase/lib/jetty-sslengine-6.1.26.jar:/opt/hbase/lib/joni-2.1.2.jar:/opt/hbase/lib/jruby-complete-1.6.8.jar:/opt/hbase/lib/jsch-0.1.42.jar:/opt/hbase/lib/jsp-2.1-6.1.14.jar:/opt/hbase/lib/junit-4.12.jar:/opt/hbase/lib/servlet-api-2.5-6.1.14.jar:/opt/hbase/lib/servlet-api-2.5.jar:/opt/hbase/lib/spymemcached-2.11.6.jar:/opt/hive-hbase//opt/hive-hbase/hive-hbase-handler-2.0.1.jar
Is there any solution? Or is it a Spark issue or a bug?
I found the solution.
first I tried to upgrade java, hadoop, hbase and spark version but I encountered the same exception.
finally, I found the reason.
in spark/conf/spark-defauls.conf I have added spark.driver.extraClassPath parameter already, while the workers needed spark.executor.extraClassPath parameter to find those jar files.
with that, the previous versions of them worked correctly.

Cassandra Connector fails when run under Spark 2.3 on Kubernetes

I'm trying to use the connector, which I've used a bunch of times in the past super successfully, with the new Spark 2.3 native Kubernetes support and am running into a lot of trouble.
I have a super simple job that looks like this:
package io.rhom
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
import com.datastax.spark.connector.rdd.ReadConf
/** Computes an approximation to pi */
object BackupLocations {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("BackupLocations")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set(
"fs.defaultFS",
"wasb://<snip>"
)
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.rhomlocations.blob.core.windows.net",
"<snip>"
)
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "locations", "keyspace" -> "test"))
.load()
df.write
.mode("overwrite")
.format("com.databricks.spark.avro")
.save("wasb://<snip>")
spark.stop()
}
}
which I'm building under SBT with Scala 2.11 and packaging with a Dockerfile that looks like this:
FROM timfpark/spark:20180305
COPY core-site.xml /opt/spark/conf
RUN mkdir -p /opt/spark/jars
COPY target/scala-2.11/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar /opt/spark/jars
and then executing with:
bin/spark-submit --master k8s://blue-rhom-io.eastus2.cloudapp.azure.com:443 \
--deploy-mode cluster \
--name backupLocations \
--class io.rhom.BackupLocations \
--conf spark.executor.instances=2 \
--conf spark.cassandra.connection.host=10.1.0.10 \
--conf spark.kubernetes.container.image=timfpark/rhom-backup-locations:20180306v12 \
--jars https://dl.bintray.com/spark-packages/maven/datastax/spark-cassandra-connector/2.0.3-s_2.11/spark-cassandra-connector-2.0.3-s_2.11.jar,http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.2/hadoop-azure-2.7.2.jar,http://central.maven.org/maven2/com/microsoft/azure/azure-storage/3.1.0/azure-storage-3.1.0.jar,http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \
local:///opt/spark/jars/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar
all of this works except for the Cassandra connection piece, which eventually fails with:
2018-03-07 01:19:38 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 10.4.0.46, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Exception during preparation of SELECT "user_id", "timestamp", "accuracy", "altitude", "altitude_accuracy", "course", "features", "latitude", "longitude", "source", "speed" FROM "rhom"."locations" WHERE token("user_id") > ? AND token("user_id") <= ? ALLOW FILTERING: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:323)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:339)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
at org.apache.spark.sql.catalyst.ReflectionLock$.<init>(ReflectionLock.scala:5)
at org.apache.spark.sql.catalyst.ReflectionLock$.<clinit>(ReflectionLock.scala)
at com.datastax.spark.connector.types.TypeConverter$.<init>(TypeConverter.scala:73)
at com.datastax.spark.connector.types.TypeConverter$.<clinit>(TypeConverter.scala)
at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:50)
at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:46)
at com.datastax.spark.connector.types.ColumnType$.converterToCassandra(ColumnType.scala:231)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:312)
... 23 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.package$ScalaReflectionLock$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 41 more
2018-03-07 01:19:38 INFO TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 3, 10.4.0.46, executor 1, partition 0, ANY, 9486 bytes)
I've tried every thing I can possibly think of to resolve this - anyone have any ideas? Is this possibly caused by another unrelated issue?
It turns out that version 2.0.7 of the Datastax Cassandra Connector does not support Spark 2.3 currently. I opened a JIRA ticket on Datastax's site for this and hopefully it will be addressed soon.

Issue with Zeppelin on Spark-Cassandra system: Classnotfoundexception

I have recently started to work with zeppelin on top of a Spark-Cassandra Cluster (Master + 3 Workers) System to run simple machine learning algorithms using the MLlib library.
Here are the libraries that I loaded to zeppelin:
%dep
z.load("com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M1")
z.load("org.apache.spark:spark-core_2.10:1.4.1")
z.load("com.datastax.cassandra:cassandra-driver-core:2.1.3")
z.load("org.apache.thrift:libthrift:0.9.2")
z.load("org.apache.spark:spark-mllib_2.10:1.4.0")
z.load("cassandra-clientutil-2.1.3.jar")
z.load("joda-time-2.3.jar")
I've tried to implement a script for linear regression. But, when I run it I get the following error message :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.xxx.xxx.xxx): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
...
The problem is that the script runs without problems using spark-submit script, which confused me.
Here is some of the code that I was trying to execute:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel, LabeledPoint}
import org.apache.spark.rdd.RDD
sc.stop()
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "xxx.xxx.xxx.xxx").setMaster("spark://xxx.xxx.xxx.xxx:7077").setAppName("DEMONSTRATION")
val sc = new SparkContext(conf)
case class Fact(numdoc:String, numl:String, year:String, creator:Double, date:Double, day:Double, user:Double, workingday:Double, total:String)
val data= sc.textFile("~/Input/Data.csv ยป)
val parsed = data.filter(!_.isEmpty).map {row =>
val splitted = row.split(",")
val Array(nd, nl, yr)=splitted.slice(0,3)
val Array(cr, dt, wd, us, wod)=splitted.slice(3,8).map(_.toDouble)
Fact (nd, nl, yr, cr, dt, wd, us, wod, splitted(8))
}
val class2id = parsed.map(_.total.toDouble).distinct.collect.zipWithIndex.map{case (k,v) => (k, v.toDouble)}.toMap
val id2class = class2id.map(_.swap)
val parsedData = parsed.map { i => LabaledPoint(class2id(i.total.toDouble), Array(i.creator,i.date,i.day,i.workingday))
val model: LinearRegressionModel = LinearRegressionWithSGD.train(parsedData, 3)
Thank you in advance !
I finally found a solution !
In fact I shouldn't stop the SparkContext in the beginning and create a new one. But, In that case, I could not get access to Cassandra in a remote machine because by default zeppelin uses the address of the machine where it is installed as the address of Cassandra host. So I installed a new Cassandra instance there and I added it to my initial cluster, and the problem was solved.

Resources