Print RDD out to console in spark streaming - apache-spark

I write a spark streaming application to receive data from Kafka by using KafkaUtils, and what I want to do is to print out data I received from Kafka. Here is my code(I use spark-submit to execute my spark streaming job):
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
messages.print()
When I run this, it works pretty fine. If the input is a,b,c in Kafka producer, I can get the result from Spark streaming as below:
Time: 1476481700000 ms
-------------------------------------------
(null,a)
(null,b)
(null,c)
But if I add one line to count the number of lines, messages.print() cannot work. Codes are shown below:
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
messages.print()
messages.count().print()
I am getting the following result:
-------------------------------------------
Time: 1476481800000 ms
-------------------------------------------
4
Only count number is getting printed out, and data cannot be printed out.
My question is why messages.print() would not be executed after I add messages.count.print().
Another question is what null stands for in the tuple (null, a)(null, b)(null, c).

There is no issue with print() and it will print both messages and count like below. Scroll and check your log.
-------------------------------------------
Time: 1476481700000 ms
-------------------------------------------
(null,a)
(null,b)
(null,c)
-------------------------------------------
Time: 1476481800000 ms
-------------------------------------------
4
KafkaUtils.createDirectStream method returns DStream of <Kafka topic, Kafka message>. Check this and this post related to topic is null.

Your code should be working but giving you an alternative.But this approach is only meant for testing or learning. Instead of performing two actions , you can achieve the end goal with just single action
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
//Cache your RDD before you perform any heavyweight operations.
messages.cache()
val result = messages.collect();
println(result.size + " size")
result.foreach { input => println(input) }

Related

Issue with HBase in spark streaming

I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => {
val context = TaskContext.get
logger.info((s"Process for partition: ${context.partitionId} "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
})
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks

Spark Structured Streaming Kafka Microbatch count

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream
This is a snippet:
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.load()
I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.
Like this:
val counter = kafka_df
.groupBy("topic")
.count()
Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.
This effectively means that, the count reported by the streamingQuery is cumulative.
Like this:
val counter_json = counter.toJSON //to jsonify
val count_query = counter_json
.writeStream.outputMode("update")
.format("console")
.start() // kicks of lazy evaluation
.awaitTermination()
In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500
The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:
topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500
But it doesn't. It actually prints:
topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500
This I understand is because of 'outputMode' complete/update (cumulative)
My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?
From the docs, I found out about the watermark approach (to support append):
val windowedCounts = kafka_df
.withWatermark("timestamp", "10 seconds")
.groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"topic")
.count()
val console_query = windowedCounts
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
But the results of this console_query are inaccurate and appears is way off mark.
TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.
If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.option("maxOffsetsPerTrigger", 500)
.load()
"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."
You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).
This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("NumInputRows: " + queryProgress.progress.numInputRows)
}
})
In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:
queryProgress.progress.inputRowsPerSecond
queryProgress.progress.processedRowsPerSecond
In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

Data loss Spark 2.1 -kafka broker 0.8.2.1 streaming

1 streaming and Kafka broker version 0.8.2.1, I have separate servers for spark and kafka on AWS.
Using val directKafkaStream = KafkaUtils.createDirectStream direct approach. StreamingContext(conf, Seconds(300)), I am expecting to get 30 string from streaming but actual receiving only 15-25 in range . Cross check kafka consumer on same topic showing 30 string during 300 seconds. And stream.foreachRDD { rdd => giving 15to 20 strings.
What is wrong behind getting uneventual data. I am using sparksession creating sc and ssc.
Thank You.
add auto.offset.reset to smallest in kafka param
val kafkaParams = Map[String, String](
"auto.offset.reset" -> "smallest", ......)

How to save data to process later after stopping DirectStream in SparkStreaming?

I am creating below KafkaDirectStream.
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
Then saving the values as :
val lines = messages.map(_.value)
Then stoping the streaming context when I have no further offset to consume as follows:
lines.foreachRDD(rdd => {
if(rdd.isEmpty()) {
messages.stop()
ssc.stop(false)
} else {
}
})
Then I am printing the lines as follows:
lines.print()
Then I am starting stream as:
ssc.start()
It is working fine. It reads rdds and prints top 10 and stops messages stream and stop streaming context. But then when I execute the same line lines.print() it throws an exception saying cannot do new inputs, transform, or outputs after stoping streamingContext.
How do I achieve my goal? I am running it in a spark-shell not as a binary (mandatory requirement).
Here is what I actually want to achieve:
1) Consume all json records from the kafka topic.
2) Stop getting further records (It is guarenteed that after consuming, there won't be no new records added to Kafka topic, so don't want to keep proessing no records.)
3) Do some preprocessing by extracting some fields from the JSON fields.
4) Do further operation on the preprocessed data.
5) Done.
when you are calling "lines.print()" again, its trying to call the transformation "messages.map(_.value)" again. As you stopped the context its failing.
Save the lines variable by performing an action before stopping the context.

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Resources