What is the best way to perform multiple filter operations on spark streaming dataframe read from Kafka? - apache-spark

I need to apply multiple filters on a DataFrame read from a Kafka topic and publish output of each of these filter to an external system (like another Kafka topic).
I have read the kafkaDF like this
val kafkaDF: DataFrame = spark.readStream
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.select(col("topic"), expr("cast(value as string) as message"))
.filter(col("message").isNotNull && col("message") =!= "")
.select(from_json(col("message"), eventsSchema).as("eventData"))
I am able to run a foreachBatch on this Dataframe and then iterate over the list of filters to get the filtered data which then can be published to a kafka topic, as shown below
.foreachBatch { (batch: DataFrame, _: Long) =>
// List of filters that needs to be applied
filterList.par.foreach(filterString => {
val filteredDF = batch.filter(filterString)
// Add some columns.
// Do some operations based on different filter
filteredDF.toJSON.foreach(value => {
// Publish a message to Kafka
.trigger(Trigger.ProcessingTime("60 seconds"))
But, I am not sure if this is the best way given so many iterations. Is there a better way than doing it like this?

If you plan to write data from one Kafka topic into multiple Kafka topics you can create a column called "topic" in a single Dataframe when writing to Kafka. The value in this column then defines the topic in which a record will be produced. This allows you to write to as many different Kafka topics as required.
Therefore, I would just apply your filter logic as a when/otherwise condition or, if more complex, as a UDF.
Below is an example code that should get you started. Based on the value of the consumed Kafka message, a column called "topic" gets created in the filteredDf. If value = 1 then the Dataframe record gets produced into the topic called "out1", and otherwise the recod gets produced into topic called "out2".
val inputDf = spark.readStream
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.option("failOnDataLoss", "false")
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "partition", "offset", "timestamp")
val filteredDf = inputDf.withColumn("topic", when(filter, lit("out1")).otherwise(lit("out2")))
val query = filteredDf
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/home/michael/sparkCheckpoint/1/")
EDIT: (I might have misunderstood your question initially)
If you just want to find a good way to apply multiple filters out of your filterList you can combine them using foldLeft:
val filter1 = col("value") === 1
val filter2 = col("key") === 1
val filterList = List(filter1, filter2)
val filterAll = filterList.tail.foldLeft(filterList.head)((f1, f2) => f1.and(f2))
((value = 1) AND (key = 1))
Then apply .filter(filterAll) to your Dataframe.


How to calculate moving average in spark structured streaming?

I am trying to calculate a moving average in a spark structured streaming in terms of rows preceding and not time-event based.
Kafka has string messages like this:
and there is this code
Dataset<Row> lines = sparkSession.readStream()
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "users")
.selectExpr("CAST(value AS STRING)")
.map((MapFunction<Row, Row>) row -> {
String message = row.getAs("value");
String[] newRow = message.split("#");
return RowFactory.create(newRow);
}, RowEncoder.apply(structType))
.selectExpr("CAST(item AS STRING)", "CAST(value AS DOUBLE)", "CAST(timestamp AS TIMESTAMP)");
The above code reads stream from kafka and transforms string messages to rows.
When i try to do sth like this:
WindowSpec threeRowWindow = Window.partitionBy("item").orderBy("timestamp").rowsBetween(Window.currentRow(), -3);
Dataset<Row> testWindow =
lines.withColumn("avg", functions.avg("value").over(threeRowWindow));
I get this error:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
Is there any other way to calculate the moving average as every message is coming and updating it as new data comes from stream? Or any non time-based operation is by default not supported to spark structured streaming?

Read whole Kafka topic as spark dataframe in offsets batches

I am trying to read all data in a kafka topic in batches (reading between two offset values) and load them to spark dataframes, without using readStream in spark streaming.
My idea is:
I first get the total number of data lines in the topic finding the maximum offset value.
I define step, namely the total number of data per batch.
With a for loop I read the data batch from the kafka topic setting startingOffsets and endingOffsets parameters.
This is my code (for a topic with a single partition) to print the count in each batch:
val maxOffsetValue = {
Process(s"kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic topicname")
val step = 1000
for (i <- 0 until maxOffsetValue by step) {
val df: DataFrame = {
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicname")
.option("startingOffsets", s"""{"topicname":{"0":${i}}}""")
.option("endingOffsets", s"""{"topicname":{"0":${i+step}}}""")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.select(from_json(col("value"), dataSchema) as "data")
println(s"i: ${i}, i+step: ${i+step}, count: ${df.count()}")
However, it seems that the json format for startingOffsets and endingOffsets is not flexible, as apparently all offsets indices need to be specified for each partition, e.g something like {"0":${i}, "1": ${i}}} if there are two partitions.
My questions are:
Is there a better way to achieve the same results, possibly that can be extended directly to a multi partition topic?
Is there a way to read the maximum offset without using a shell command?

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
2- Parse data to join them:
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

How to convert spark streaming Dataset[String] to DataFrame[Row]

I have a non-standard kafka format messages
so the code looks like as following
val df:Dataset[String] = spark
.option("subscribe", topic)
.map { v =>
val e = MyAvroSchema.decodeEnvelope(v)
val d = MyAvroSchema.decodeDatum(e)
At this point d is a string that represents csv line, For example
Assuming that I can create a csvSchema:StructType
How can I convert it to the Dataframe[Row] with csvSchema?
One complication is that schema size is big (about 85 columns), so creating case class, or tuple is not really an option

Split dataset based on column value

I have a Dataset<Row> which is a resultant of Kafka readStream as shown below in Java code snippet.
m_oKafkaEvents = getSparkSession().readStream().format("kafka")
.option("kafka.bootstrap.servers", strKafkaAddress)
.option("subscribe", getInsightEvent().getTopic())
.option("maxOffsetsPerTrigger", "100000")
.option("startingOffsets", "latest")
.option("failOnDataLoss", false)
.select(functions.from_json(functions.col("value").cast("string"), oSchema).as("events"))
I need to split this dataset based on column "Model" which would result in two Dataset as below;
These Datasets would be published into Kafka sink. The topic name would be the model value. i.e Opportunity_1 and Opportunity_2.
Hence I need to have a handle column "Model" value and respective events list.
Since am new to spark, am looking for help on how this can be achieved via java code.
Appreciate any help.
The simplest solution would look like:
allEvents.selectExpr("topic", "CONCAT('m_oKafkaEvents_for_', Model, '_topic')")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
You can see an example here https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html#writing-the-output-of-batch-queries-to-kafka . But after looking at Spark's code, it seems that we can have only 1 topic/write, i.e. it'll chose as topic the first encountered row:
def write(
sparkSession: SparkSession,
queryExecution: QueryExecution,
kafkaParameters: ju.Map[String, Object],
topic: Option[String] = None): Unit = {
val schema = queryExecution.analyzed.output
validateQuery(schema, kafkaParameters, topic)
queryExecution.toRdd.foreachPartition { iter =>
val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic)
Utils.tryWithSafeFinally(block = writeTask.execute(iter))(
finallyBlock = writeTask.close())
You can try this approach though and tell here if it works as told above ? If it doesn't work, you have alternative solutions, as:
Cache main DataFrame and create 2 other DataFrames, filtered by Model attribute
Use foreachPartition and Kafka writer to send the messages without splitting the main dataset
The first solution is pretty easy to implement and you use all Spark facilities to do that. In the other side and at least theoritecally, splitting the dataset should be slightly slower than the second proposal. But try to measure before chosing one or another option, maybe the difference will be really small and it's always better to use clear and community-approven approach.
Below you can find some code showing both situations:
SparkSession spark = SparkSession
Dataset<Row> allEvents = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", "")
.option("subscribe", "event")
.option("maxOffsetsPerTrigger", "100000")
.option("startingOffsets", "latest")
.option("failOnDataLoss", false)
.select(functions.from_json(functions.col("value").cast("string"), null).as("events"))
// First solution
Dataset<Row> opportunity1Events = allEvents.filter("Model = 'Opportunity_1'");
opportunity1Events.write().format("kafka").option("kafka.bootstrap.servers", "")
.option("topic", "m_oKafkaEvents_for_Opportunity_1_topic").save();
Dataset<Row> opportunity2Events = allEvents.filter("Model = 'Opportunity_2'");
opportunity2Events.write().format("kafka").option("kafka.bootstrap.servers", "")
.option("topic", "m_oKafkaEvents_for_Opportunity_2_topic").save();
// Note: Kafka writer was added in 2.2.0 https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c
// Another approach with iteration throughout messages accumulated within each partition
allEvents.foreachPartition(new ForeachPartitionFunction<Row>() {
private KafkaProducer<String, Row> localProducer = new KafkaProducer<>(new HashMap<>());
private final Map<String, String> modelsToTopics = new HashMap<>();
modelsToTopics.put("Opportunity_1", "m_oKafkaEvents_for_Opportunity_1_topic");
modelsToTopics.put("Opportunity_2", "m_oKafkaEvents_for_Opportunity_2_topic");
public void call(Iterator<Row> rows) throws Exception {
// If your message is Opportunity1 => add to messagesOpportunity1
// otherwise it goes to Opportunity2
while (rows.hasNext()) {
Row currentRow = rows.next();
// you can reformat your row here or directly in Spark's map transformation
localProducer.send(new ProducerRecord<>(modelsToTopics.get(currentRow.getAs("Model")),
"some_message_key", currentRow));
// KafkaProducer accumulates messages in a in-memory buffer and sends when a threshold was reached
// Flush them synchronously here to be sure that every stored message was correctly
// delivered
// You can also play with features added in Kafka 0.11: the idempotent producer and the transactional producer
