How can I write a Dataframe into Kafka? - apache-spark

My Dataframe df looks like
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
I creating a Kafka Sink for streaming queries, but I received nothing from kafka. why?
ds = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("topic", "topic1") \
.start()

You will not receive anything from Kafka because based on your code you are trying to select the columns key and value from a Dataframe which has only columns age and name. You need to select those as shown below.
Also, you do not need writeStream if your Dataframe is static. In that case you need to apply write and save.
import org.apache.spark.sql.functions.{col, struct, to_json}
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder()
.appName("myAppName")
.master("local[*]")
.getOrCreate()
// create DataFrame
import spark.implicits._
val df = Seq((3, "Alice"), (5, "Bob")).toDF("age", "name")
df.show(false)
// +---+-----+
// |age|name |
// +---+-----+
// |3 |Alice|
// |5 |Bob |
// +---+-----+
// write to Kafka as is with "age" as key and "name" as value
df.selectExpr("CAST(age AS STRING) as key", "CAST(name AS STRING) as value")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test-topic")
.save()
If you want to store your data into a json string you can apply the following"
// convert columns into json string
val df2 = df.select(col("name"),to_json(struct($"*"))).toDF("key", "value")
df2.show(false)
// +-----+------------------------+
// |key |value |
// +-----+------------------------+
// |Alice|{"age":3,"name":"Alice"}|
// |Bob |{"age":5,"name":"Bob"} |
// +-----+------------------------+

Related

Can't deserialize avro file in spark

There is some problem trying to deserialize data from .avro file. My process consists of these steps:
reading from Kafka
df = (
spark.read.format("kafka")
.option("kafka.security.protocol", "PLAINTEXT")
.option("kafka.sasl.mechanism", "GSSAPI")
.option("kafka.bootstrap.servers", KAFKA_BROKERS)
.option("subscribe", KAKFA_TOPIC)
.option("group.id", "noid")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.option("maxOffsetsPerTrigger", 2)
.load()
.selectExpr(
"CAST(value AS binary)"
)
)
Writing to hdfs
df = df.select("value")
df.write.format("avro").save("path/to/avro_files")
Loading .avro file
schema = """{
"type":"record",
"name":"GPRSIOT",
"fields":[
{
"name":"field_1",
"type":"string",
"default":"null"
}
...
]
}
dff= spark.read.format("avro") \
.option("avroSchema", schema) \
.load("path/to/avro_files/part-00000-ff1186a6-dce4-492e-87b8-8bb4332f4bc1-c000.avro")
The problem is that I get nulls as values.
+-------+-------+-------+
|field_1|field_2|field_3|
+-------+-------+-------+
|null |0 |null |
+-------+-------+-------+
A small part of my .avro file:
Ɖ�v|y
f�30387f��������������������������������������Z�19807F� ؖ��1.2�A0Z��"�b
0FDDZv�
169r&Z��~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~Z~32210F�+��ߘv~y�1.1Z}*�L2FF3E�N�
�~841576��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+05321F�
Any ideas on how to deserialize an avro file with real values instead of nulls?

Unable to write results to kafka topic using spark

My end goal is to write out and read the aggregated data to the new Kafka topic in the batches it gets processed. I followed the official documentation and a couple of other posts but no luck. I would first read the topic, perform aggregation, save the results in another Kafka topic, and again read the topic and print it in the console. Below is my code:
package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
object SparkKafkaTopic3 {
def main(ar: Array[String]) {
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo5")
.option("startingOffsets", "earliest")
.load()
import spark.implicits._
df.printSchema()
val newDf = df.select($"value".cast("string"), $"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
val outputTopic = windowedCount
.select(struct("*").cast("string").as("value")) // Added this line.
.writeStream
.format("kafka")
.option("topic", "songDemo6")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/spark_ss/")
.start()
val finalOutput = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo6").option("startingOffsets", "earliest")
.load()
.writeStream.format("console")
.outputMode("append").start()
spark.streams.awaitAnyTermination()
}
}
When I run this, in the console initially there is a below exception
java.lang.IllegalStateException: Cannot find earliest offsets of Set(songDemo4-0). Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
Also, if I try to run this code without writing to the topic part and reading it again everything works fine.
I tried to read the topic from the shell using consumer command but no records are displayed. Is there anything that I am missing over here?
Below is my dataset:
>sid,Believer
>sid,Thunder
>sid,Stairway to heaven
>sid,Heaven
>sid,Heaven
>sid,thunder
>sid,Believer
When I ran #Srinivas's code and after reading the new topic I am getting data as below:
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Stairway to heaven, 1]
[[2020-06-07 18:40:40, 2020-06-07 18:41:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Thunder, 1]
Here you can see for Believer the window frame is the same but still, the entries are separate. Why is it so? It should be single entry with count 2 since the window frame is the same
Check below code.
Added this windowedCount.select(struct("*").cast("string").as("value")) before you write anything to kafka you have to convert all columns of type string alias of that column is value
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo")
.option("startingOffsets", "earliest")
.load()
import spark.implicits._
df.printSchema()
val newDf = df.select($"value".cast("string"),$"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
val outputTopic = windowedCount
.select(struct("*").cast("string").as("value")) // Added this line.
.writeStream
.format("kafka")
.option("topic", "songDemoA")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/spark_ss/")
.start()
val finalOutput = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemoA").option("startingOffsets", "earliest")
.load()
.writeStream.format("console")
.outputMode("append").start()
spark.streams.awaitAnyTermination()
Updated - Ordering Output
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
.orderBy($"window.start".asc) // Add this line if you want order.
Ordering or sorting result works only if you use output mode is complete for any other values it will throw an error.
For example check below code.
val outputTopic = windowedCount
.writeStream
.format("console")
.option("truncate","false")
.outputMode("complete")
.start()

How to get Avro data from Confluent Schema Registry in String format from kafka in pyspark?

I am reading data from Kafka in spark(structured streaming) But Data getting in spark from kafka in spark is not in string format.
Spark: 2.3.4
Kafka Data format:
{"Patient_ID":316,"Name":"Richa","MobileNo":{"long":7049123177},"BDate":{"int":740},"Gender":"female"}
Here is the code for kafka to spark structured streaming:
# spark-submit --jars kafka-clients-0.10.0.1.jar --packages org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.4,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 /home/kinjalpatel/kafka_sppark.py
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json
from pyspark.sql.functions import from_json, col, struct
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from pyspark.sql.column import Column, _to_java_column
sc = SparkContext()
sc.setLogLevel("ERROR")
spark = SparkSession(sc)
schema_registry_client = CachedSchemaRegistryClient(
url='http://localhost:8081')
serializer = MessageSerializer(schema_registry_client)
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "mysql-01-Patient") \
.option("partition.assignment.strategy", "range") \
.option("valueConverter", "org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter") \
.load()
df.printSchema()
mta_stream=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(timestamp AS STRING)", "CAST(timestampType AS STRING)")
mta_stream.printSchema()
qry = mta_stream.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()
This is the output I get:
+----+--------------------+----------------+---------+------+--------------------+-------------+
| key| value| topic|partition|offset| timestamp|timestampType|
+----+--------------------+----------------+---------+------+--------------------+-------------+
|null|�
Richa���...|mysql-01-Patient| 0| 160|2019-12-27 11:56:...| 0|
+----+--------------------+----------------+---------+------+--------------------+-------------+
How to get value column in string format?
from Spark documentation
import org.apache.spark.sql.avro._
// `from_avro` requires Avro schema in JSON string format.
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc" )))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
val output = df
.select(from_avro('value, jsonFormatSchema) as 'user)
.where("user.favorite_color == \"red\"")
.select(to_avro($"user.name") as 'value)
val query = output
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic2")
.start()
from databricks documentation
import org.apache.spark.sql.avro._
import org.apache.avro.SchemaBuilder
// When reading the key and value of a Kafka topic, decode the
// binary (Avro) data into structured data.
// The schema of the resulting DataFrame is: <key: string, value: int>
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", SchemaBuilder.builder().stringType()).as("key"),
from_avro($"value", SchemaBuilder.builder().intType()).as("value"))
For Reading Avro message from Kafka topic and parsing in pyspark structured streaming, don't have direct libraries for the same . But we can read/parsing Avro message by writing small wrapper and call that function as UDF in your pyspark streaming code.
Please refer:
Reading avro messages from Kafka in spark streaming/structured streaming

Structured streaming truncates kafka timestamps to seconds

I am reading data from Kafka with Spark Structured Streaming and want to include the Kafka timestamp in the message:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:10000")
.option("subscribe", "topicname")
.option("includeTimestamp", true)
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.as[(String, String, String, Long)]
When I then look at the timestamp it is truncated from milliseconds to seconds. Is there any way I can get the millisecond precision back after reading?
The truncation happens when the timestamps is read as a Long value. This happens in the last line of:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:10000")
.option("subscribe", "topicname")
.option("includeTimestamp", true)
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.as[(String, String, String, Long)]
It does not truncate when you change the last line to:
.as[(String, String, String, Timestamp)]
I just tried this quickly in IntelliJ with my local Kafka set-up.
If you are referring to the three dots in the end of the timestamp field as truncation (as in the output below):
Batch: 1
-------------------------------------------
+-----+----+--------+--------------------+
|topic| key| value| timestamp|
+-----+----+--------+--------------------+
| test|null|test-123|2018-10-07 03:10:...|
| test|null|test-234|2018-10-07 03:10:...|
+-----+----+--------+--------------------+
Then you just need to add the following line:
.option("truncate", false)
in your writeStream() portion like:
Dataset<Row> df = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("includeTimestamp", "true")
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(timestamp as STRING)");
try {
df.writeStream()
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}
This change gave me the full timestamp in the output:
Batch: 1
-------------------------------------------
+-----+----+--------+-----------------------+
|topic|key |value |timestamp |
+-----+----+--------+-----------------------+
|test |null|test-123|2018-10-07 03:19:50.677|
|test |null|test-234|2018-10-07 03:19:52.673|
+-----+----+--------+-----------------------+
I hope this helps.

Spark (Scala) Structured Streaming Aggregation and Self Join

I'm trying to perform an aggregation followed by a self-join on a Structured Streaming DataFrame. Let's suppose the df looks like as follows:
sourceDf.show(false)
+-----+-------+
|owner|fruits |
+-----+-------+
|Brian|apple |
|Brian|pear |
|Brian|melon |
|Brian|avocado|
|Bob |avocado|
|Bob |apple |
+-----+-------+
On a static DataFrame, it is easy:
val aggDf = sourceDf.groupBy($"owner").agg(collect_list(col("fruits")) as "fruitsA")
sourceDf.join(aggDf, Seq("owner")).show(false)
+-----+-------+-----------------------------+
|owner|fruits |fruitsA |
+-----+-------+-----------------------------+
|Brian|apple |[apple, pear, melon, avocado]|
|Brian|pear |[apple, pear, melon, avocado]|
|Brian|melon |[apple, pear, melon, avocado]|
|Brian|avocado|[apple, pear, melon, avocado]|
|Bob |avocado|[avocado, apple] |
|Bob |apple |[avocado, apple] |
+-----+-------+-----------------------------+
Unfortunately, I'm unable to figure out how to do this in the case of a Streaming DataFrame. So, I tried using the following complete code that uses Kafka for both Source and Sink:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType, StructType}
object Test {
val spark: SparkSession = SparkSession.builder().getOrCreate()
import spark.implicits._
val brokers = "kafka:9092"
val inputTopic = "test.kafka.sink.input"
val aggTopic = "test.kafka.sink.agg"
val outputTopicSelf = "test.kafka.sink.output.self"
val outputTopicTwo = "test.kafka.sink.output.two"
val payloadSchema: StructType = new StructType()
.add("owner", StringType)
.add("fruits", StringType)
val payloadSchemaA: StructType = new StructType()
.add("owner", StringType)
.add("fruitsA", StringType)
var joinedDfSchema: StructType = _
val sourceDf: DataFrame = Seq(
("Brian", "apple"),
("Brian", "pear"),
("Brian", "melon"),
("Brian", "avocado"),
("Bob", "avocado"),
("Bob", "apple")
)
.toDF("owner", "fruits")
val additionalData: DataFrame = Seq(("Bob", "grapes")).toDF("owner", "fruits")
def saveDfToKafka(df: DataFrame): Unit = {
df
.select(to_json(struct(df.columns.map(column): _*)).alias("value"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("topic", inputTopic)
.save()
}
// save data to kafka (batch)
saveDfToKafka(sourceDf)
// kafka source
val farmDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("startingOffsets", "earliest")
.option("subscribe", inputTopic)
.load()
.byteArrayToString("value")
.withColumn("value", from_json($"value", payloadSchema))
.expand("value")
farmDF.printSchema()
implicit class DFHelper(df: DataFrame) {
def expand(column: String): DataFrame = {
val wantedColumns = df.columns.filter(_ != column) :+ s"$column.*"
df.select(wantedColumns.map(col): _*)
}
def byteArrayToString(column: String): DataFrame = {
val selectedCols = df.columns.filter(_ != column) :+ s"CAST($column AS STRING)"
df.selectExpr(selectedCols: _*)
}
}
def testSelfAggJoinFail(): Unit = {
// aggregated df
val myFarmDF = farmDF
.groupBy($"owner")
.agg(collect_list(col("fruits")) as "fruitsA")
// joined df
val joinedDF = farmDF
.join(myFarmDF.as("myFarmDF"), Seq("owner"))
.select("owner", "fruits", "myFarmDF.fruitsA")
joinedDfSchema = joinedDF.schema
// stream sink
joinedDF
.select(to_json(struct(joinedDF.columns.map(column): _*)).alias("value"))
.writeStream
.outputMode("append")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointSelf")
.option("topic", outputTopicSelf)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
}
def testSelfAggJoin(): Unit = {
// aggregated df
val myFarmDF = farmDF
.withWatermark("timestamp", "30 seconds")
.groupBy(
window($"timestamp", "30 seconds", "15 seconds"),
$"owner"
)
.agg(collect_list(col("fruits")) as "fruitsA")
.select("owner", "fruitsA", "window")
// joined df
val joinedDF = farmDF
.as("farmDF")
.withWatermark("timestamp", "30 seconds")
.join(
myFarmDF.as("myFarmDF"),
expr(
"""
|farmDF.owner = myFarmDF.owner AND
|farmDF.timestamp >= myFarmDF.window.start AND
|farmDF.timestamp <= myFarmDF.window.end
""".stripMargin))
.select("farmDF.owner", "farmDF.fruits", "myFarmDF.fruitsA")
joinedDfSchema = joinedDF.schema
// stream sink
joinedDF
.select(to_json(struct(joinedDF.columns.map(column): _*)).alias("value"))
.writeStream
.outputMode("append")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointSelf")
.option("topic", outputTopicSelf)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
}
def testTwoDfAggJoin(): Unit = {
// aggregated df
val myFarmDF = farmDF
.withWatermark("timestamp", "30 seconds")
.groupBy(
$"owner"
)
.agg(collect_list(col("fruits")) as "fruitsA")
.select("owner", "fruitsA")
// save the aggregated df to kafka
myFarmDF
.select(to_json(struct(myFarmDF.columns.map(column):_*)).alias("value"))
.writeStream
.outputMode("update")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointAgg")
.option("topic", aggTopic)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
// read the aggregated df from kafka as a stream
val aggDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("startingOffsets", "earliest")
.option("subscribe", aggTopic)
.load()
.byteArrayToString("value")
.withColumn("value", from_json($"value", payloadSchemaA))
.expand("value")
.withWatermark("timestamp", "30 seconds")
// joined df
val joinedDF = farmDF
.as("farmDF")
.join(
aggDF.as("myFarmDF"),
expr(
"""
|farmDF.owner = myFarmDF.owner AND
|farmDF.timestamp >= myFarmDF.timestamp - interval 1 hour AND
|farmDF.timestamp <= myFarmDF.timestamp + interval 1 hour
""".stripMargin))
.select("farmDF.owner", "myFarmDF.fruitsA", "farmDF.fruits")
joinedDfSchema = joinedDF.schema
// stream sink
joinedDF
.select(to_json(struct(joinedDF.columns.map(column):_*)).alias("value"))
.writeStream
.outputMode("append")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointTwo")
.option("topic", outputTopicTwo)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
}
def data(topic: String): DataFrame = {
// let's read back the output topic using kafka batch
spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.load()
.byteArrayToString("value")
.withColumn("value", from_json($"value", joinedDfSchema))
.expand("value")
}
}
Now, if I test on a Streaming DataFrame:
scala> Test.testSelfAggJoinFail
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Project [structstojson(named_struct(owner, owner#59, fruits, fruits#60, fruitsA, fruitsA#78), Some(Etc/UTC)) AS value#96]
+- Project [owner#59, fruits#60, fruitsA#78]
+- Project [owner#59, key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, fruits#60, fruitsA#78]
+- Join Inner, (owner#59 = owner#82)
:- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, value#51.owner AS owner#59, value#51.fruits AS fruits#60]
: +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, jsontostructs(StructField(owner,StringType,true), StructField(fruits,StringType,true), value#43, Some(Etc/UTC), true) AS value#51]
: +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, cast(value#30 as string) AS value#43]
: +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#3269e790, kafka, Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092), [key#29, value#30, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#42eeb996,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092),None), kafka, [key#22, value#23, topic#24, partition#25, offset#26L, timestamp#27, timestampType#28]
+- SubqueryAlias myFarmDF
+- Aggregate [owner#82], [owner#82, collect_list(fruits#83, 0, 0) AS fruitsA#78]
+- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, value#51.owner AS owner#82, value#51.fruits AS fruits#83]
+- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, jsontostructs(StructField(owner,StringType,true), StructField(fruits,StringType,true), value#43, Some(Etc/UTC), true) AS value#51]
+- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, cast(value#30 as string) AS value#43]
+- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#3269e790, kafka, Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092), [key#29, value#30, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#42eeb996,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092),None), kafka, [key#22, value#23, topic#24, partition#25, offset#26L, timestamp#27, timestampType#28]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:110)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:235)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:299)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:296)
at Test$.testSelfAggJoinFail(<console>:123)
... 51 elided
it fails with Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark because I don't use watermarks.
Now, if I can run the second test with
Test.testSelfAggJoin
I get these warnings
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (window#70-T30000ms.start - timestamp#139-T30000ms) due to window#70-T30000ms.start
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (timestamp#139-T30000ms - window#70-T30000ms.end) due to window#70-T30000ms.end
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (window#70-T30000ms.start - timestamp#139-T30000ms) due to window#70-T30000ms.start
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (timestamp#139-T30000ms - window#70-T30000ms.end) due to window#70-T30000ms.end
And I can check the result with
Test.data(Test.outputTopicSelf).show(false)
2018-09-12 16:08:01 WARN NetworkClient:882 - [Consumer clientId=consumer-5, groupId=spark-kafka-relation-02f5512f-cc3c-40ad-938f-e3dfdca95f8c-driver-0] Error while fetching metadata with correlation id 2 : {test.kafka.sink
.output.self=LEADER_NOT_AVAILABLE}
2018-09-12 16:08:01 WARN NetworkClient:882 - [Consumer clientId=consumer-5, groupId=spark-kafka-relation-02f5512f-cc3c-40ad-938f-e3dfdca95f8c-driver-0] Error while fetching metadata with correlation id 6 : {test.kafka.sink
.output.self=LEADER_NOT_AVAILABLE}
+---+-----+---------+------+---------+-------------+-----+------+-------+
|key|topic|partition|offset|timestamp|timestampType|owner|fruits|fruitsA|
+---+-----+---------+------+---------+-------------+-----+------+-------+
+---+-----+---------+------+---------+-------------+-----+------+-------+
which returns an empty DataFrame (probably because of the warning?).
I was unable to find a solution with a self-join.
Finally I tried by sinking the aggregation to Kafka and re-reading it as a second Streaming DataFrame, as in
scala> Test.data(Test.outputTopicTwo).show(false)
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|key |topic |partition|offset|timestamp |timestampType|owner|fruitsA |fruits |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|null|test.kafka.sink.output.two|0 |0 |2018-09-12 16:57:04.376|0 |Brian|["avocado","apple","pear","melon"]|avocado|
|null|test.kafka.sink.output.two|0 |1 |2018-09-12 16:57:04.376|0 |Bob |["apple","avocado"] |apple |
|null|test.kafka.sink.output.two|0 |2 |2018-09-12 16:57:04.38 |0 |Brian|["avocado","apple","pear","melon"]|apple |
|null|test.kafka.sink.output.two|0 |3 |2018-09-12 16:57:04.38 |0 |Bob |["apple","avocado"] |avocado|
|null|test.kafka.sink.output.two|0 |4 |2018-09-12 16:57:04.381|0 |Brian|["avocado","apple","pear","melon"]|pear |
|null|test.kafka.sink.output.two|0 |5 |2018-09-12 16:57:04.382|0 |Brian|["avocado","apple","pear","melon"]|melon |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
which works (although not very efficiently, I'd say) but if I add additional data to the source topic:
scala> Test.saveDfToKafka(Test.additionalData)
scala> Test.data(Test.outputTopicTwo).show(false)
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|key |topic |partition|offset|timestamp |timestampType|owner|fruitsA |fruits |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|null|test.kafka.sink.output.two|0 |0 |2018-09-12 16:57:04.376|0 |Brian|["avocado","apple","pear","melon"]|avocado|
|null|test.kafka.sink.output.two|0 |1 |2018-09-12 16:57:04.376|0 |Bob |["apple","avocado"] |apple |
|null|test.kafka.sink.output.two|0 |2 |2018-09-12 16:57:04.38 |0 |Brian|["avocado","apple","pear","melon"]|apple |
|null|test.kafka.sink.output.two|0 |3 |2018-09-12 16:57:04.38 |0 |Bob |["apple","avocado"] |avocado|
|null|test.kafka.sink.output.two|0 |4 |2018-09-12 16:57:04.381|0 |Brian|["avocado","apple","pear","melon"]|pear |
|null|test.kafka.sink.output.two|0 |5 |2018-09-12 16:57:04.382|0 |Brian|["avocado","apple","pear","melon"]|melon |
|null|test.kafka.sink.output.two|0 |6 |2018-09-12 16:59:37.125|0 |Bob |["apple","avocado"] |grapes |
|null|test.kafka.sink.output.two|0 |7 |2018-09-12 16:59:40.001|0 |Bob |["apple","avocado","grapes"] |apple |
|null|test.kafka.sink.output.two|0 |8 |2018-09-12 16:59:40.002|0 |Bob |["apple","avocado","grapes"] |avocado|
|null|test.kafka.sink.output.two|0 |9 |2018-09-12 16:59:40.002|0 |Bob |["apple","avocado","grapes"] |grapes |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
I get many more rows, probably because I had to use .outputMode("update") while sinking the aggregation Df.
Is there a way to perform this aggregation without sending the aggregation back to Kafka as a separate topic?
If not, is it possible to modify testTwoDfAggJoin to use .outputMode("append")?
As of Spark 2.3, Join of two streaming DF is not possible when there are some aggregate functions involved before join.
From the spark documentation
Additional details on supported joins:
Joins can be cascaded, that is, you can do df1.join(df2, ...).join(df3, ...).join(df4, ....).
As of Spark 2.3, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.
As of Spark 2.3, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.
Cannot use streaming aggregations before joins.
Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.
I encountered the similar error info, outputMode is important to agg, I solved by adding df.writeStream.outputMode(OutputMode.Update()) or df.writeStream.outputMode(OutputMode.Complete())
Ref:
Output Modes There are a few types of output modes.
Append mode (default) - This is the default mode, where only the new
rows added to the Result Table since the last trigger will be
outputted to the sink. This is supported for only those queries where
rows added to the Result Table is never going to change. Hence, this
mode guarantees that each row will be output only once (assuming
fault-tolerant sink). For example, queries with only select, where,
map, flatMap, filter, join, etc. will support Append mode.
Complete mode - The whole Result Table will be outputted to the sink
after every trigger. This is supported for aggregation queries.
Update mode - (Available since Spark 2.1.1) Only the rows in the
Result Table that were updated since the last trigger will be
outputted to the sink. More information to be added in future
releases.
http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-3/

Resources