how to determine where the Kafka consumption is made on Spark - apache-spark

We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))

Related

How to calculate moving average in spark structured streaming?

I am trying to calculate a moving average in a spark structured streaming in terms of rows preceding and not time-event based.
Kafka has string messages like this:
device1#227.92#2021-08-19T12:15:13.540Z
and there is this code
Dataset<Row> lines = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "users")
.load()
.selectExpr("CAST(value AS STRING)")
.map((MapFunction<Row, Row>) row -> {
String message = row.getAs("value");
String[] newRow = message.split("#");
return RowFactory.create(newRow);
}, RowEncoder.apply(structType))
.selectExpr("CAST(item AS STRING)", "CAST(value AS DOUBLE)", "CAST(timestamp AS TIMESTAMP)");
The above code reads stream from kafka and transforms string messages to rows.
When i try to do sth like this:
WindowSpec threeRowWindow = Window.partitionBy("item").orderBy("timestamp").rowsBetween(Window.currentRow(), -3);
Dataset<Row> testWindow =
lines.withColumn("avg", functions.avg("value").over(threeRowWindow));
I get this error:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
Is there any other way to calculate the moving average as every message is coming and updating it as new data comes from stream? Or any non time-based operation is by default not supported to spark structured streaming?
Thanks

Read whole Kafka topic as spark dataframe in offsets batches

I am trying to read all data in a kafka topic in batches (reading between two offset values) and load them to spark dataframes, without using readStream in spark streaming.
My idea is:
I first get the total number of data lines in the topic finding the maximum offset value.
I define step, namely the total number of data per batch.
With a for loop I read the data batch from the kafka topic setting startingOffsets and endingOffsets parameters.
This is my code (for a topic with a single partition) to print the count in each batch:
val maxOffsetValue = {
Process(s"kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic topicname")
.!!
.split(":")
.last
.trim
.toInt
}
val step = 1000
for (i <- 0 until maxOffsetValue by step) {
val df: DataFrame = {
spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicname")
.option("startingOffsets", s"""{"topicname":{"0":${i}}}""")
.option("endingOffsets", s"""{"topicname":{"0":${i+step}}}""")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.select(from_json(col("value"), dataSchema) as "data")
.select("data.*")
}
println(s"i: ${i}, i+step: ${i+step}, count: ${df.count()}")
}
However, it seems that the json format for startingOffsets and endingOffsets is not flexible, as apparently all offsets indices need to be specified for each partition, e.g something like {"0":${i}, "1": ${i}}} if there are two partitions.
My questions are:
Is there a better way to achieve the same results, possibly that can be extended directly to a multi partition topic?
Is there a way to read the maximum offset without using a shell command?

Structured Streaming: Reading from multiple Kafka topics at once

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}
Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

Spark dataframe lose streaming capability after accessing Kafka source

I use Spark 2.4.3 and Kafka 2.3.0. I want to do Spark structured streaming with data coming from Kafka to Spark. In general it does work in the test mode but since I have to do some processing of the data (and do not know another way to do) the Spark data frames do not have the streaming capability anymore.
#!/usr/bin/env python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
# create schema for data
schema = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])
# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()
# create DataFrame representing the stream
dsraw = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test")
print("dsraw.isStreaming: ", dsraw.isStreaming)
# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")
print("ds.isStreaming: ", ds.isStreaming)
# Do query on the converted data
dsQuery = ds.writeStream.queryName("ds_query").format("memory").start()
df1 = spark.sql("select * from ds_query")
print("df1.isStreaming: ", df1.isStreaming)
# convert json into spark dataframe cols
df2 = df1.withColumn("value", from_json("value", schema))
print("df2.isStreaming: ", df2.isStreaming)
The output is:
dsraw.isStreaming: True
ds.isStreaming: True
df1.isStreaming: False
df2.isStreaming: False
So I lose the streaming capability when I create the first dataframe. How can I avoid it? How do I get a streaming Spark dataframe out of a stream?
It is not recommend to use the memory sink for production applications as all the data will be stored in the driver.
There is also no reason to do this, except for debugging purposes, as you can process your streaming dataframes like the 'normal' dataframes. For example:
import pyspark.sql.functions as F
lines = spark.readStream.format("socket").option("host", "XXX.XXX.XXX.XXX").option("port", XXXXX).load()
words = lines.select(lines.value)
words = words.filter(words.value.startswith('h'))
wordCounts = words.groupBy("value").count()
wordCounts = wordCounts.withColumn('count', F.col('count') + 2)
query = wordCounts.writeStream.queryName("test").outputMode("complete").format("memory").start()
In case you still want to go with your approach: Even if df.isStreaming tells you it is not a streaming dataframe (which is correct), the underlying datasource is a stream and the dataframe will therefore grow with each processed batch.

Spark Structured Streaming Kafka Microbatch count

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream
This is a snippet:
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.load()
I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.
Like this:
val counter = kafka_df
.groupBy("topic")
.count()
Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.
This effectively means that, the count reported by the streamingQuery is cumulative.
Like this:
val counter_json = counter.toJSON //to jsonify
val count_query = counter_json
.writeStream.outputMode("update")
.format("console")
.start() // kicks of lazy evaluation
.awaitTermination()
In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500
The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:
topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500
But it doesn't. It actually prints:
topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500
This I understand is because of 'outputMode' complete/update (cumulative)
My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?
From the docs, I found out about the watermark approach (to support append):
val windowedCounts = kafka_df
.withWatermark("timestamp", "10 seconds")
.groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"topic")
.count()
val console_query = windowedCounts
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
But the results of this console_query are inaccurate and appears is way off mark.
TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.
If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.option("maxOffsetsPerTrigger", 500)
.load()
"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."
You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).
This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("NumInputRows: " + queryProgress.progress.numInputRows)
}
})
In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:
queryProgress.progress.inputRowsPerSecond
queryProgress.progress.processedRowsPerSecond
In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

Resources