Spark streaming NetworkWordCount example creates multiple jobs per batch - apache-spark

I am running the basic NetworkWordCount program on yarn cluster through spark-shell. Here is my code snippet -
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
val ssc = new StreamingContext(sc, Seconds(60))
val lines = ssc.socketTextStream("172.26.32.34", 9999, StorageLevel.MEMORY_ONLY)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
The output on console and stats on Streaming tab are also as expected.
But when I look at jobs tab, per 1-minute batch interval two jobs get triggered, shouldn't it be one job per interval? Screenshot below -
Now when I look at completed batches on Streaming UI, I see exactly one batch per minute.Screenshot below -
Am I missing something? Also, I noticed that the start job also has two states with the same name that spawns a different number of tasks as seen in image below, what exactly is happening here?

Related

foreachRDD sometimes takes too long between batches

I have a problem, we are using Kafka and spark.
val ssc = new StreamingContext(conf, Seconds(10))
val messages = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[K, V](config.topics, scala.collection.Map[String, Object](kafkaParams.toSeq: _*), offsetRange)))
messages.foreachRDD {(rdd, time) => ...}
It works well, but sometimes new batch begins to start after about 10 minutes after previous one. Times are measured by log messages.
Why is that happening?
I`ve found the reason, it was due to issues.apache.org/jira/browse/KAFKA-12890

Spark mapPartitions Issue

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

Spark structured streaming process each row on different worker nodes as soon as it arraives

Using spark 2.3 structred streaming and kafka as the input stream.
My cluster is built from master and 3 workers. (master runs on one of the worker machines)
My kafka topic has 3 partitions as the number of the workers.
I am using the default trigger and foreach sink to process the data.
When the first message arrives to the driver it is immediately starting the processing of the data on one of the available worker nodes, while processing, a second message arrives, instead of immediatly start processing it on the available worker, the "execution" of the processing is delayed until the first worker ends the processing, now all of the "waiting executions" start processing parallel on all the available workers. (lets say I have 3 waiting messages)
How can I force the execution to start immetetaly on the waiting worker?
** A snippet of my code: **
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(filePath: String) = {
val filesSeq = fileHandler
.handleData(filePath) // long processing
}
override def close(errorOrNull: Throwable) = {}
}
val filesDf = kafkaStreamSubscriber
.buildtream(conf, kafkaInputTopic)
val ds = filesDf.map(x=>x.getAs("filePath").asInstanceOf[String])
val query =
ds.writeStream
.foreach(writer)
.start
ds.writeStream
.format("console")
.option("truncate", "false")
.start()
println("lets go....")
query.awaitTermination()
What have I doing wrong? I don't want to have idle workers when I have waiting data to process
Thanx
Refer to Spark Structured Streaming Triggers documentation section
As far as I understand, default trigger process one micro batch at a time. I would suggest consider Experimental Continuous mode if you need process data as soon as it arrives.
My understanding is that if you use trigger with let's say 5 seconds, the micro batch will read messages from all 3 partitions and you will have 3 tasks running in the same time. Until they all finished, there will be no micro batch started.
Hope it helps!

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel?

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
Here is my code using spark-shell
import org.apache.spark.sql._
import org.apache.spark.sql.types.StringType
spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown "true")""")
val df = spark.sql("SELECT test from hello")
val df2 = df.select(df("test").cast(StringType).as("test"))
val rdd = df2.rdd.map { case Row(j: String) => j }
val df4 = spark.read.json(rdd) // This line takes forever
I have about 700 million rows each row is about 1KB and this line
val df4 = spark.read.json(rdd) takes forever as I get the following output.
Stage 1:==========> (4866 + 24) / 25256]
so at this rate it will probably take roughly 3hrs.
I measured the network throughput rate of spark worker nodes using iftop and it is about 75MB/s (Megabytes per second) which is pretty good but I am not sure if it is reading partitions in parallel. Any ideas on how to make it faster?
Here is my DAG.

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Resources