Get max,min of offset from kafka dataframe - apache-spark

Below is how I'm reading data from kafka.
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", """{"topic1":{"1":-1}}""")
.load()
val df = inputDf.selectExpr("CAST(value AS STRING)","CAST(topic AS STRING)","CAST (partition AS INT)","CAST (offset AS INT)","CAST (timestamp AS STRING)")
How can I get the max & min offsets and timestamp from above dataframe? I want to save it to some external source for future reference.I cannot use 'agg' function as i'm writing same dataframe to writestream(as shown below)
val kafkaOutput = df.writeStream
.outputMode("append")
.option("path", "/warehouse/download/data1")
.format("console")
.option("checkpointLocation", checkpoint_loc)
.start()
.awaitTermination()

If you can upgrade your Spark version to 2.4.0 you will be able to solve this issue.
In Spark 2.4.0, you have spark foreachbatch api through which you can write the same DataFrame to multiple sinks.
Spark.writestream.foreachbatch((batchDF, batchId) => some_fun(batchDF)).start()
some_fun(batchDF): { persist the DF & perform the aggregation}

Related

How to read multiple Kafka topics with Spark into seperate Dataframes

I am publishing data to 2 afka topics name "akr" and "akr2". How can I read them in separate dataframes?
According to the Spark + Kafka Integration Guide and assuming ou plan to process them with Structured Streaming you can define the required two Dataframes as below:
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "akr")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "akr2")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
The data of the two mentioned topics will be consumed as soon as you have some Spark actions on the Dataframes.

Unable to write results to kafka topic using spark

My end goal is to write out and read the aggregated data to the new Kafka topic in the batches it gets processed. I followed the official documentation and a couple of other posts but no luck. I would first read the topic, perform aggregation, save the results in another Kafka topic, and again read the topic and print it in the console. Below is my code:
package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
object SparkKafkaTopic3 {
def main(ar: Array[String]) {
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo5")
.option("startingOffsets", "earliest")
.load()
import spark.implicits._
df.printSchema()
val newDf = df.select($"value".cast("string"), $"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
val outputTopic = windowedCount
.select(struct("*").cast("string").as("value")) // Added this line.
.writeStream
.format("kafka")
.option("topic", "songDemo6")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/spark_ss/")
.start()
val finalOutput = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo6").option("startingOffsets", "earliest")
.load()
.writeStream.format("console")
.outputMode("append").start()
spark.streams.awaitAnyTermination()
}
}
When I run this, in the console initially there is a below exception
java.lang.IllegalStateException: Cannot find earliest offsets of Set(songDemo4-0). Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
Also, if I try to run this code without writing to the topic part and reading it again everything works fine.
I tried to read the topic from the shell using consumer command but no records are displayed. Is there anything that I am missing over here?
Below is my dataset:
>sid,Believer
>sid,Thunder
>sid,Stairway to heaven
>sid,Heaven
>sid,Heaven
>sid,thunder
>sid,Believer
When I ran #Srinivas's code and after reading the new topic I am getting data as below:
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Stairway to heaven, 1]
[[2020-06-07 18:40:40, 2020-06-07 18:41:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Thunder, 1]
Here you can see for Believer the window frame is the same but still, the entries are separate. Why is it so? It should be single entry with count 2 since the window frame is the same
Check below code.
Added this windowedCount.select(struct("*").cast("string").as("value")) before you write anything to kafka you have to convert all columns of type string alias of that column is value
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo")
.option("startingOffsets", "earliest")
.load()
import spark.implicits._
df.printSchema()
val newDf = df.select($"value".cast("string"),$"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
val outputTopic = windowedCount
.select(struct("*").cast("string").as("value")) // Added this line.
.writeStream
.format("kafka")
.option("topic", "songDemoA")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/spark_ss/")
.start()
val finalOutput = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemoA").option("startingOffsets", "earliest")
.load()
.writeStream.format("console")
.outputMode("append").start()
spark.streams.awaitAnyTermination()
Updated - Ordering Output
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
.orderBy($"window.start".asc) // Add this line if you want order.
Ordering or sorting result works only if you use output mode is complete for any other values it will throw an error.
For example check below code.
val outputTopic = windowedCount
.writeStream
.format("console")
.option("truncate","false")
.outputMode("complete")
.start()

Writing multiple streams sequentially in spark structured streaming

I am consuming data from kafka through spark structured streaming and trying to write it to 3 different sources. I want the streams to execute sequentially as the logic(in writer) in stream2(query2) is dependent on stream1(query1). What is happening is the query2 get executed before query1 and my logic breaks.
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("assign"," {\""+topic+"\":[0]}")
.load()
val query1 = inputDf.selectExpr("CAST (partition AS INT)","CAST (offset AS INT)","CAST (timestamp AS STRING)")
df1.agg(min("offset"), max("offset"))
.writeStream
.foreach(writer)
.outputMode("complete")
.trigger(Trigger.ProcessingTime("2 minutes"))
.option("checkpointLocation", checkpoint_loc1).start()
//result= (derived from some processing over 'inputDf' dataframe)
val query2 = result.select(result("eventdate")).distinct
distDates.writeStream.foreach(writer1)
.trigger(Trigger.ProcessingTime("2 minutes"))
.option("checkpointLocation", checkpoint_loc2).start()
val query3 = result.writeStream
.outputMode("append")
.format("orc")
.partitionBy("eventdate")
.option("path", "/warehouse/test_duplicate/download/data1")
.option("checkpointLocation", checkpoint_loc)
.option("maxRecordsPerFile", 999999999)
.trigger(Trigger.ProcessingTime("2 minutes"))
.start()
spark.streams.awaitAnyTermination()
result.checkpoint()

Writing Spark Structured Streaming Output to a Kafka Topic

I have a simple structured streaming application that just reads data from one Kafka topic and writes to another.
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test");
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "start")
.load();
StreamingQuery query = dataset
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "checkpoint")
.option("topic", "end")
.start();
query.awaitTermination(20000);
There are two messages to be processed on the topic start. This code runs without exception, however no messages ever end up on the topic end. What is wrong with this example?
The problem is that the messages were already on the stream and the starting offset was not set to "earliest".
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", start.getTopicName())
.option("startingOffsets", "earliest")
.load();

Spark structured streaming - 2 ReadStreams in one app

Is it possible to have two separate ReadStreams in one app? I'm trying to listen to two separate Kafka topics and do calculations based on both DataFrames.
You could simply subscribe to multiple topics:
// Subscribe to multiple topics
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Or, if you specifically want to use two separated readStream definitions within one app:
// read stream A
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// read stream B
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic2")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
You should be able to achieve this by using join() in Spark 2.3.0:
val stream1 = spark.readStream. ...
val stream2 = spark.readStream. ...
val joinedDf = stream1.join(stream2, "join_column_id")

Resources