Spark Streaming works in Local mode but "stages fail" with "could not initialize class" in Client/Cluster mode - apache-spark

I have a Spark + Kafka streaming app that runs fine in Local mode, however when I try to launch it in yarn + local/cluster mode I get several errors like below
The first error I always see is
WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 9, ip-xxx-24-129-36.ec2.internal, executor 2): java.lang.NoClassDefFoundError: Could not initialize class TestStreaming$
at TestStreaming$$anonfun$main$1$$anonfun$apply$1.apply(TestStreaming.scala:60)
at TestStreaming$$anonfun$main$1$$anonfun$apply$1.apply(TestStreaming.scala:59)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Next error I get is
ERROR JobScheduler: Error running job streaming job 1541786030000 ms.0
followed by
java.lang.NoClassDefFoundError: Could not initialize class
Spark version 2.1.0
Scala 2.11
Kafka version 10
Part of my code when I launch it loads the config in main. I pass this config file at runtime with -conf AFTER the jar (see below). I'm not quite sure but must I pass this config to the executors as well?
I launch my streaming app with the command below. One shows Local mode, the other shows client mode.
runJar = myProgram.jar
loggerPath=/path/to/log4j.properties
mainClass=TestStreaming
logger=-DPHDTKafkaConsumer.app.log4j=$loggerPath
confFile=application.conf
-----------Local Mode----------
SPARK_KAFKA_VERSION=0.10 nohup spark2-submit --driver-java-options
"$logger" --conf "spark.executor.extraJavaOptions=$logger" --class
$mainClass --master local[4] $runJar -conf $confFile &
-----------Client Mode----------
SPARK_KAFKA_VERSION=0.10 nohup spark2-submit --master yarn --conf >"spark.executor.extraJavaOptions=$logger" --conf >"spark.driver.extraJavaOptions=$logger" --class $mainClass $runJar -conf >$confFile &
Here is my code below. Been battling this for over a week now.
import Util.UtilFunctions
import UtilFunctions.config
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.log4j.Logger
object TestStreaming extends Serializable {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
def main(args: Array[String]) {
logger.info("Starting app")
UtilFunctions.loadConfig(args)
UtilFunctions.loadLogger()
val props: Map[String, String] = setKafkaProperties()
val topic = Set(config.getString("config.TOPIC_NAME"))
val conf = new SparkConf()
.setAppName(config.getString("config.SPARK_APP_NAME"))
.set("spark.streaming.backpressure.enabled", "true")
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
ssc.checkpoint(config.getString("config.SPARK_CHECKPOINT_NAME"))
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topic, props))
val distRecordsStream = kafkaStream.map(record => (record.key(), record.value()))
distRecordsStream.window(Seconds(10), Seconds(10))
distRecordsStream.foreachRDD(rdd => {
if(!rdd.isEmpty()) {
rdd.foreach(record => {
println(record._2) //value from kafka
})
}
})
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
def setKafkaProperties(): Map[String, String] = {
val deserializer = "org.apache.kafka.common.serialization.StringDeserializer"
val zookeeper = config.getString("config.ZOOKEEPER")
val offsetReset = config.getString("config.OFFSET_RESET")
val brokers = config.getString("config.BROKERS")
val groupID = config.getString("config.GROUP_ID")
val autoCommit = config.getString("config.AUTO_COMMIT")
val maxPollRecords = config.getString("config.MAX_POLL_RECORDS")
val maxPollIntervalms = config.getString("config.MAX_POLL_INTERVAL_MS")
val props = Map(
"bootstrap.servers" -> brokers,
"zookeeper.connect" -> zookeeper,
"group.id" -> groupID,
"key.deserializer" -> deserializer,
"value.deserializer" -> deserializer,
"enable.auto.commit" -> autoCommit,
"auto.offset.reset" -> offsetReset,
"max.poll.records" -> maxPollRecords,
"max.poll.interval.ms" -> maxPollIntervalms)
props
}
}

Related

Spark3.0 failed to write stream from cosmos changefeed to snowflake

I use:
<dependency>
<groupId>com.azure.cosmos.spark</groupId>
<artifactId>azure-cosmos-spark_3-1_2-12</artifactId>
<version>4.4.0</version>
</dependency>
to read stream from cosmos changefeed, but I run faild because of
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.streaming.MetadataVersionUtil$
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
I found that MetadataVersionUtil is a new class in spark3.1, and I can't find a jar suitable for spark3.0, such as azure-cosmos-spark_3-0_2-12.
So I clone https://github.com/Azure/azure-cosmosdb-spark.git and checkout 3.0 branch, and use mvn install to release the jar. But when I use this new jar, it failed because of:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: cosmos.oltp.changeFeed. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:209)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:195)
at com.fti.cosmos.StreamToSnowflake$.main(StreamToSnowflake.scala:79)
at com.fti.cosmos.StreamToSnowflake.main(StreamToSnowflake.scala)
Caused by: java.lang.ClassNotFoundException: cosmos.oltp.changeFeed.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
... 4 more
22/02/25 17:15:56 INFO SparkContext: Invoking stop() from shutdown hook
the scala code is:
val readConfig = Map(
"spark.cosmos.accountEndpoint" -> s"${cosmosHost}:${cosmosPort}",
"spark.cosmos.accountKey" -> cosmosPassword,
"spark.cosmos.database" -> cosmosSourceDB,
"spark.cosmos.container" -> cosmosSourceCollection,
"spark.cosmos.read.partitioning.strategy" -> "Default",
"spark.cosmos.read.inferSchema.enabled" -> "false",
"spark.cosmos.changeFeed.startFrom" -> cosmosCfStartFrom,
"spark.cosmos.changeFeed.mode" -> "Incremental"
)
// init spark
val ss = SparkSession.builder
.appName(s"${this.getClass.getName}-${cosmosSourceDB}-${cosmosSourceCollection}")
.master("local")
.getOrCreate()
val df = ss.readStream.format("cosmos.oltp.changeFeed").options(readConfig).load()
val sfOptions = Map(
"sfURL" -> "",
"sfUser" -> "",
"sfRole" -> s"",
"pem_private_key" -> "",
"sfDatabase" -> s"",
"sfSchema" -> sfSchema,
"sfWarehouse" -> sfWarehouse
)
df.writeStream.trigger(Trigger.ProcessingTime("30 seconds"))
.foreachBatch((ds: DataFrame, batchId: Long) => {
ds.toDF().show(false)
ds.toDF().write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", targetTable)
.mode(SaveMode.Append)
.save()
})
.option("checkpointLocation", f"wasbs://....")
.outputMode("append")
.start()
.awaitTermination()
Has anyone successfully synced cosmos's changefeed with spark3.0, I don't want to upgrade HdInsight to 3.1 anymore.

Failed to load com.saprk.demo.Hive. java.lang.ClassNotFoundException: com.saprk.demo.Hive

package com.saprk.demo
import org.apache.spark.sql.SparkSession
object Hive {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.master("local")
.appName("Spark SQL basic example")
.config("hive.metastore.warehouse.dir", "hdfs://user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
spark.sql("create database employee")
spark.sql("show databases").show()
}
}
I am trying to create a database in Hive through spark and while submitting this on amazon emr i am getting exception
Failed to load com.saprk.demo.Hive. java.lang.ClassNotFoundException: com.saprk.demo.Hive

Spark streaming and kafka Missing required configuration "partition.assignment.strategy" which has no default value

I am trying to run the spark streaming application with Kafka using yarn. I am getting the following Stack trace error-
Caused by: org.apache.kafka.common.config.ConfigException: Missing required configuration "partition.assignment.strategy" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:124)
at org.apache.kafka.common.config.AbstractConfig.(AbstractConfig.java:48)
at org.apache.kafka.clients.consumer.ConsumerConfig.(ConsumerConfig.java:194)
at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:380)
at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:363)
at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:350)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.(CachedKafkaConsumer.scala:45)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:194)
at org.apache.spark.streaming.kafka010.KafkaRDDIterator.(KafkaRDD.scala:252)
at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:212)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
Here is snippet of my code of how I am creating my KafkaStream with spark stream-
val ssc = new StreamingContext(sc, Seconds(60))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "*boorstrap_url:port*",
"security.protocol" -> "SASL_PLAINTEXT",
"sasl.kerberos.service.name" -> "kafka",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "annotation-test",
//Tried commenting and uncommenting this property
//"partition.assignment.strategy"->"org.apache.kafka.clients.consumer.RangeAssignor",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("*topic-name*")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val valueKafka = kafkaStream.map(record => record.value())
I have gone through the following post -
https://issues.apache.org/jira/browse/KAFKA-4547
Pyspark Structured Streaming Kafka configuration error
According to this I have updated my kafka util jar in my fat jar to 0.10.2.0 version from 0.10.1.0 packaged by default from spark-stream-kafka-jar as transient dependency. Also my job is working fine when I am running it on single node by setting master as local. I am running spark 2.3.1 version.
Add kafka-clients-*.jar to your spark jar folder. kafka-clients-*.jar is in kafka-*/lib directory.

Exception while integrating spark with Kafka

Doing Spark-kafka streaming on word-count. Built a jar using sbt.
When I do spark-submit the following exception is throwing.
Exception in thread "streaming-start" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.isDirectory()Z
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:245)
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.<init>(FileBasedWriteAheadLog.scala:80)
at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:141)
at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:99)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:256)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:254)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.<init>(ReceivedBlockTracker.scala:77)
at org.apache.spark.streaming.scheduler.ReceiverTracker.<init>(ReceiverTracker.scala:109)
at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:87)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:583)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
at org.apache.spark.util.ThreadUtils$$anon$2.run(ThreadUtils.scala:126)
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#12010fd1{/streaming,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#552ed807{/streaming/json,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#7318daf8{/streaming/batch,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#3f1ddac2{/streaming/batch/json,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#37864b77{/static/streaming,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO streaming.StreamingContext: StreamingContext started
my spark submit:
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 --class "KafkaWordCount" --master local[4] scala_project_2.11-1.0.jar localhost:2181 test-consumer-group word-count 1
scala_version: 2.11.8
spark_version: 2.2.0
sbt_version: 1.0.3
object KafkaWordCount {
def main(args: Array[String]) {
val (zkQuorum, group, topics, numThreads) = ("localhost:2181", "test-consumer-group", "word-count", 1)
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.foreachRDD(rdd => println("#####################rdd###################### " + rdd.first))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}

Spark Streaming Kafka stream

I'm having some issues while trying to read from kafka with spark streaming.
My code is:
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("KafkaIngestor")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "localhost:2181",
"group.id" -> "consumergroup",
"metadata.broker.list" -> "localhost:9092",
"zookeeper.connection.timeout.ms" -> "10000"
//"kafka.auto.offset.reset" -> "smallest"
)
val topics = Set("test")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
I previously started zookeeper at port 2181 and Kafka server 0.9.0.0 at port 9092.
But I get the following error in the Spark driver:
Exception in thread "main" java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
at scala.Option.map(Option.scala:145)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
Zookeeper log:
[2015-12-08 00:32:08,226] INFO Got user-level KeeperException when processing sessionid:0x1517ec89dfd0000 type:create cxid:0x34 zxid:0x1d3 txntype:-1 reqpath:n/a Error Path:/brokers/ids Error:KeeperErrorCode = NodeExists for /brokers/ids (org.apache.zookeeper.server.PrepRequestProcessor)
Any hint?
Thank you very much
The problem was related the wrong spark-streaming-kafka version.
As described in the documentation
Kafka: Spark Streaming 1.5.2 is compatible with Kafka 0.8.2.1
So, including
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.2</version>
</dependency>
in my pom.xml (instead of version 0.9.0.0) solved the issue.
Hope this helps
Kafka10 streaming / Spark 2.1.0 / DCOS / Mesosphere
Ugg I spent all day on this and must have read this post a dozen times. I tried spark 2.0.0, 2.0.1, Kafka 8, Kafka 10. Stay away from Kafka 8 and spark 2.0.x, and dependencies are everything. Start with below. It works.
SBT:
"org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-common"),
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0" ,
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
Working Kafka/Spark Streaming code:
val spark = SparkSession
.builder()
.appName("ingest")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(2))
val topics = Set("water2").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "broker:port,broker:port",
"bootstrap.servers" -> "broker:port,broker:port",
"group.id" -> "somegroup",
"auto.commit.interval.ms" -> "1000",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> "true"
)
val messages = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
messages.foreachRDD(rdd => {
if (rdd.count() >= 1) {
rdd.map(record => (record.key, record.value))
.toDS()
.withColumnRenamed("_2", "value")
.drop("_1")
.show(5, false)
println(rdd.getClass)
}
})
ssc.start()
ssc.awaitTermination()
Please like if you see this so I can get some reputation points. :)

Resources