Structured Streaming: Reading from multiple Kafka topics at once - apache-spark

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}

Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

Related

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example?
Here is the code I tried.
val schemaRegistryAddr = "https://..."
val avroSchema = buildSchema(topic) //defined this method
val Df = spark.readStream.format("delta").load("path..")
.withColumn("key", col("lskey").cast(StringType))
.withColumn("topLevelRecord",struct(col("col1"),col("col2")...)
.select(
to_avro($"key", lit("topic-key"), schemaRegistryAddr).as("key"),
to_avro($"topLevelRecord", lit("topic-value"), schemaRegistryAddr, avroSchema).as("value"))
Df.writeStream
.format("kafka")
.option("checkpointLocation",checkpointPath)
.option("kafka.bootstrap.servers", bootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
.option("kafka.ssl.keystore.password", keystorePassword)
.option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
.option("topic",topic)
.option("batch.size",262144)
.option("linger.ms",5000)
.trigger(ProcessingTime("25 seconds"))
.start()
But it fails with: org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
But when I try to write to the same topic using a Batch Producer it goes through successfully. Can anyone please let me know what am I missing in the streaming write to Kafka topic?
Later I found this old blog which says that current Structured Streaming API does not support 'kakfa' format.
https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333

How to calculate moving average in spark structured streaming?

I am trying to calculate a moving average in a spark structured streaming in terms of rows preceding and not time-event based.
Kafka has string messages like this:
device1#227.92#2021-08-19T12:15:13.540Z
and there is this code
Dataset<Row> lines = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "users")
.load()
.selectExpr("CAST(value AS STRING)")
.map((MapFunction<Row, Row>) row -> {
String message = row.getAs("value");
String[] newRow = message.split("#");
return RowFactory.create(newRow);
}, RowEncoder.apply(structType))
.selectExpr("CAST(item AS STRING)", "CAST(value AS DOUBLE)", "CAST(timestamp AS TIMESTAMP)");
The above code reads stream from kafka and transforms string messages to rows.
When i try to do sth like this:
WindowSpec threeRowWindow = Window.partitionBy("item").orderBy("timestamp").rowsBetween(Window.currentRow(), -3);
Dataset<Row> testWindow =
lines.withColumn("avg", functions.avg("value").over(threeRowWindow));
I get this error:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
Is there any other way to calculate the moving average as every message is coming and updating it as new data comes from stream? Or any non time-based operation is by default not supported to spark structured streaming?
Thanks

how to determine where the Kafka consumption is made on Spark

We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

How to print Json encoded messages using Spark Structured Streaming

I have a DataSet[Row] where each row is JSON string. I want to just print the JSON stream or count the JSON stream per batch.
Here is my code so far
val ds = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers"))
.option("subscribe", topicName)
.option("checkpointLocation", hdfsCheckPointDir)
.load();
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as 'payload)
val ds2 = ds1.select($"payload.info")
val query = ds2.writeStream.outputMode("append").queryName("table").format("memory").start()
query.awaitTermination()
select * from table; -- don't see anything and there are no errors. However when I run my Kafka consumer separately (independent ofSpark I can see the data)
My question really is what do I need to do just print the data I am receiving from Kafka using Structured Streaming? The messages in Kafka are JSON encoded strings so I am converting JSON encoded strings to some struct and eventually to a dataset. I am using Spark 2.1.0
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as payload).select($"payload.*")
That will print your data on the console.
ds1.writeStream.format("console").option("truncate", "false").start().awaitTermination()
Always use something like awaitTermination() or thread.Sleep(time in seconds) in these type of situations.

Resources