foreachRDD sometimes takes too long between batches - apache-spark

I have a problem, we are using Kafka and spark.
val ssc = new StreamingContext(conf, Seconds(10))
val messages = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[K, V](config.topics, scala.collection.Map[String, Object](kafkaParams.toSeq: _*), offsetRange)))
messages.foreachRDD {(rdd, time) => ...}
It works well, but sometimes new batch begins to start after about 10 minutes after previous one. Times are measured by log messages.
Why is that happening?

I`ve found the reason, it was due to issues.apache.org/jira/browse/KAFKA-12890

Related

Issue with HBase in spark streaming

I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => {
val context = TaskContext.get
logger.info((s"Process for partition: ${context.partitionId} "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
})
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks

Spark streaming NetworkWordCount example creates multiple jobs per batch

I am running the basic NetworkWordCount program on yarn cluster through spark-shell. Here is my code snippet -
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
val ssc = new StreamingContext(sc, Seconds(60))
val lines = ssc.socketTextStream("172.26.32.34", 9999, StorageLevel.MEMORY_ONLY)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
The output on console and stats on Streaming tab are also as expected.
But when I look at jobs tab, per 1-minute batch interval two jobs get triggered, shouldn't it be one job per interval? Screenshot below -
Now when I look at completed batches on Streaming UI, I see exactly one batch per minute.Screenshot below -
Am I missing something? Also, I noticed that the start job also has two states with the same name that spawns a different number of tasks as seen in image below, what exactly is happening here?

Kafka Spark Streaming multiple aggregation

I am using Spark Kafka Integration 0.10 and I need two levels of aggregations on the stream:
The first one is per minute interval
The other is to sum on 15 minutes interval.
Also preference is to accumulate the one minute interval values and then reset it when 15 minute is over b/c 15 minute values should be persisted.
Having two reduceByKeysByWindows on different sliding windows does not work as it gives KafkaConcurrentModification exception.
tl;dr It seems to work. Please provide an example that fails.
I am using Spark 2.0.2 (that was released today).
My example is as follows (with some code removed for brevity):
val ssc = new StreamingContext(sc, Seconds(10))
import org.apache.spark.streaming.kafka010._
val dstream = KafkaUtils.createDirectStream[String, String](
ssc,
preferredHosts,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets))
def reduceFunc(v1: String, v2: String) = s"$v1 + $v2"
dstream.map { r =>
println(s"value: ${r.value}")
val Array(key, value) = r.value.split("\\s+")
println(s">>> key = $key")
println(s">>> value = $value")
(key, value)
}.reduceByKeyAndWindow(
reduceFunc, windowDuration = Seconds(30), slideDuration = Seconds(10))
.print()
dstream.foreachRDD { rdd =>
// Get the offset ranges in the RDD
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (o <- offsetRanges) {
println(s"${o.topic} ${o.partition} offsets: ${o.fromOffset} to ${o.untilOffset}")
}
}
ssc.start
What would you change to see the exception(s) you're experiencing?
The entire project is available as spark-streaming-kafka-direct.

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"

Spark streaming: batch interval vs window

I have spark streaming application which consumes kafka messages. And I want to process all messages coming last 10 minutes together.
Looks like there are two approaches to do job done:
val ssc = new StreamingContext(new SparkConf(), Minutes(10))
val dstream = ....
and
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val dstream = ....
dstream.window(Minutes(10), Minutes(10))
and I just want to clarify is there any performance differences between them

Resources