I have a Scala Spark Streaming application that receives data from the same topic from 3 different Kafka producers.
The Spark streaming application is on machine with host 0.0.0.179, the Kafka server is on machine with host 0.0.0.178, the Kafka producers are on machines, 0.0.0.180, 0.0.0.181, 0.0.0.182.
When I try to run the Spark Streaming application got below error
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 19.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 19.0 (TID 19, localhost):
java.util.ConcurrentModificationException: KafkaConsumer is not safe
for multi-threaded access at
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1625)
at
org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1198)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
at
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1204)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Now I read thousand of different posts but no one seems to be able to find a solution at this issue.
How can I handle this on my application? Do I have to modify some parameters on Kakfa (at the moment the num.partition parameter is set to 1)?
Following is the code of my application :
// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
case class Thema(name: String, metadata: String)
case class Tempo(unit: String, count: Int, metadata: String)
case class Spatio(unit: String, metadata: String)
case class Stt(spatial: Spatio, temporal: Tempo, thematic: Thema)
case class Location(latitude: Double, longitude: Double, name: String)
case class Datas1(location : Location, timestamp : String, windspeed : Double, direction: String, strenght : String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1, stt: Stt)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "0.0.0.178:9092",
"key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"group.id" -> "test_luca",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics1 = Array("topics1")
val s1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(record => {
implicit val formats = DefaultFormats
parse(record.value).extract[Sensors1]
}
)
s1.print()
s1.saveAsTextFiles("results/", "")
ssc.start()
ssc.awaitTermination()
Thank you
Your problem is here:
s1.print()
s1.saveAsTextFiles("results/", "")
Since Spark creates a graph of flows, and you define two flows here:
Read from Kafka -> Print to console
Read from Kafka -> Save to text file
Spark will attempt to concurrently run both of these graphs, since they are independent of each other. Since Kafka uses a cached consumer approach, it is effectively trying to use the same consumer for both stream executions.
What you can do is cache the DStream before running the two queries:
val dataFromKafka = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(/* stuff */)
val cachedStream = dataFromKafka.cache()
cachedStream.print()
cachedStream.saveAsTextFiles("results/", "")
Using cache worked for me . In my case print , transformation and then print on JavaPairDstream was giving me that error .
I used cache just before first print, it worked for me.
s1.print()
s1.saveAsTextFiles("results/", "")
Below code will work, i used similar code .
s1.cache();
s1.print();
s1.saveAsTextFiles("results/", "");
Related
Let us say I have just launched a Kafka direct stream + spark streaming application. For the first batch, the Streaming Context in the driver program connects to the Kafka and fetches startOffset and endOffset. Then, it launches a spark job with these start and end offset ranges for the executors to fetch records from Kafka. My question starts here. When its time for the second batch, the Streaming context connects to the Kafka for the start and end offset ranges. How is Kafka able to give these ranges, when there is no consumer group (as Direct stream does not take into account group.id) that allows to store the last commit offset value?
There is always a Consumer Group when working with the Kafka Consumer API. It doesn't matter what kind of Stream you are dealing with (Spark Direct Streaming, Spark Structured Streaming, Java/Scala API of Kafka Consumer...).
as Direct stream does not take into account group.id
Have a look at the Spark + Kafka integration Guide for direct streaming (code example for spark-streaming-kafka010) on how to declare a consumer group:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
Even if you are not declaring a consumer group in your configuration, there will be still a (random) consumer group created for you.
Check your logs to see which group.id has been used in your application.
Can i know if kafka consumer can read specific records when from and until offsets of partitions of a topic are known.
Use case is in my spark streaming application few batch are not processed(inserted to table) in this case i want to read only missed data. I am storing the topics details i.e partitions and offsets.
Can someone let me know if this can be achieved reading from topic when offsets are known.
If you want to process set of messages, that is defined by starting and ending offset in spark streaming you can use following code:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "groupId"
)
val offsetRanges = Array(
OffsetRange("input", 0, 2, 4) // <-- topic name, partition number, fromOffset, untilOffset
)
val sparkContext: SparkContext = ???
val rdd = KafkaUtils.createRDD(sparkContext, kafkaParams.asJava, offsetRanges, PreferConsistent)
// other proccessing and saving
More details regarding integration spark streaming and Kafka can be found: https://spark.apache.org/docs/2.4.0/streaming-kafka-0-10-integration.html
I am writing SparkStreaming data to HDFS by converting it to a dataframe:
Code
object KafkaSparkHdfs {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka")
sparkConf.set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(sparkConf)
def main(args: Array[String]): Unit = {
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val ssc = new StreamingContext(sparkConf, Seconds(20))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream3",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("fridaydata")
val stream = KafkaUtils.createDirectStream[String, String](
ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams)
)
val lines = stream.map(consumerRecord => consumerRecord.value)
val words = lines.flatMap(_.split(" "))
val wordMap = words.map(word => (word, 1))
val wordCount = wordMap.reduceByKey(_ + _)
wordCount.foreachRDD(rdd => {
val dataframe = rdd.toDF();
dataframe.write
.mode(SaveMode.Append)
.save("hdfs://localhost:9000/newfile24")
})
ssc.start()
ssc.awaitTermination()
}
}
The folder is created but the file is not written.
The program is getting terminated with the following error:
18/06/22 16:14:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
at java.lang.Thread.run(Thread.java:748)
18/06/22 16:14:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
In my pom I am using respective dependencies:
spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11
The error is due to trying to run multiple spark contexts at the same time. Setting allowMultipleContexts to true is mostly used for testing purposes and it's use is discouraged. The solution to your problem is therefore to make sure that the same SparkContext is used everywhere. From the code we can see that the SparkContext (sc) is used to create a SQLContext which is fine. However, when creating the StreamingContext it is not used, instead the SparkConf is used.
By looking at the documentation we see:
Create a StreamingContext by providing the configuration necessary for a new SparkContext
In other words, by using SparkConf as parameter a new SparkContext will be created. Now there are two separate contexts.
The easiest solution here would be to continue using the same context as before. Change the line creating the StreamingContext to:
val ssc = new StreamingContext(sc, Seconds(20))
Note: In newer versions of Spark (2.0+) use SparkSession instead. A new streaming context can then be created using StreamingContext(spark.sparkContext, ...). It can look as follows:
val spark = SparkSession().builder
.setMaster("local[*]")
.setAppName("SparkKafka")
.getOrCreate()
import sqlContext.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(20))
There is an obvious problem here - coalesce(1).
dataframe.coalesce(1)
While reducing number of files might be tempting in many scenarios, it should be done if and only if it amount of data is low enough for nodes to handle (clearly it isn't here).
Furthermore, let me quote the documentation:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
The conclusion is you should adjust the parameter accordingly to the expected amount of data and desired parallelism. coalesce(1) as such is rarely useful in practice, especially in a context like streaming, where data properties can differ over time.
I am trying out some hands-on on spark and I tried to use spark streaming to read data from a kafka topic and store that data in a elasticsearch index.
I am trying to run my code from my ide.
I added some messages in Kafka and ran my Kafka Streaming context program and it read the data but after that the program stopped.
So, if I add new data in kafka , again I have to run my streaming context program.
I want the streaming context to keep 'listening' to the kafka broker , and I should not be running it each time I add some messages in Kafka broker.
Here is my code:
val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaStreams2")
val ssc = new StreamingContext(conf, Seconds(1))
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[LongDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "spark-streaming-notes2",
"auto.offset.reset" -> "latest"
)
// List of topics you want to listen for from Kafka
val topics = List("inputstream-sink")
val lines = KafkaUtils.createDirectStream[String, String](ssc,
PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
val word = lines.map(_.value())
word.print()
insertIntoIndexes()
ssc.start()
ssc.stop(stopSparkContext = false)
I am trying to read older messages from Kafka with spark streaming. However, I am only able to retrieve messages as they are sent in real time (i.e., if I populate new messages, while my spark program is running - then I get those messages).
I am changing my groupID and consumerID to make sure zookeeper isn't just not giving messages it knows my program has seen before.
Assuming spark is seeing the offset in zookeeper as -1, shouldn't it read all the old messages in the queue? Am I just misunderstanding the way a kafka queue can be used? I'm very new to spark and kafka, so I can't rule out that I'm just misunderstanding something.
package com.kibblesandbits
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import net.liftweb.json._
object KafkaStreamingTest {
val cfg = new ConfigLoader().load
val zookeeperHost = cfg.zookeeper.host
val zookeeperPort = cfg.zookeeper.port
val zookeeper_kafka_chroot = cfg.zookeeper.kafka_chroot
implicit val formats = DefaultFormats
def parser(json: String): String = {
return json
}
def main(args : Array[String]) {
val zkQuorum = "test-spark02:9092"
val group = "myGroup99"
val topic = Map("testtopic" -> 1)
val sparkContext = new SparkContext("local[3]", "KafkaConsumer1_New")
val ssc = new StreamingContext(sparkContext, Seconds(3))
val json_stream = KafkaUtils.createStream(ssc, zkQuorum, group, topic)
var gp = json_stream.map(_._2).map(parser)
gp.saveAsTextFiles("/tmp/sparkstreaming/mytest", "json")
ssc.start()
}
When running this, I will see the following message. So I am confident that it's not just not seeing the messages because the offset is set.
14/12/05 13:34:08 INFO ConsumerFetcherManager:
[ConsumerFetcherManager-1417808045047] Added fetcher for partitions
ArrayBuffer([[testtopic,0], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,1],
initOffset -1 to broker i d:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,2], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,3],
initOffset -1 to broker id:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,4], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] )
Then, if I populate 1000 new messages -- I can see those 1000 messages saved in my temp directory. But I don't know how to read the existing messages, which should number in the (at this point) tens of thousands.
Use the alternative factory method on KafkaUtils that lets you provide a configuration to the Kafka consumer:
def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](
ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Map[String, Int],
storageLevel: StorageLevel
): ReceiverInputDStream[(K, V)]
Then build a map with your kafka configuration and add the parameter 'kafka.auto.offset.reset' set to 'smallest':
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
"zookeeper.connection.timeout.ms" -> "10000",
"kafka.auto.offset.reset" -> "smallest"
)
Provide that config to the factory method above. "kafka.auto.offset.reset" -> "smallest" tells the consumer to starts from the smallest offset in your topic.