How to consolidate the Spark streaming into array to Kafka - apache-spark

Currently, I have the following df
+-------+--------------------+-----+
| key| created_at|count|
+-------+--------------------+-----+
|Bullish|[2017-08-06 08:00...| 12|
|Bearish|[2017-08-06 08:00...| 1|
+-------+--------------------+-----+
I use the following to stream the data to Kafka
df.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092").option("topic","chart3").option("checkpointLocation", "/tmp/checkpoints2")
.outputMode("complete")
.start()
The problem here is that, for each of the row in DataFrame, it will write to Kafka one by one. My consumer will get the message one by one.
Is there any way to consolidate all rows into an array and stream to Kafka, such that my consumer can get the whole data in one go.
Thanks for advice.

My consumer will get the message one by one.
Not exactly. It may depend on Kafka property. You can specify own properties and use for example:
props.put("batch.size", 16384);
In the background Spark uses normal cached KafkaProducer. It will use properties that you will provide in options when submitting query.
See also Java Doc. Be aware, that it may not scale correctly

Related

When writing to Eventhub from PySpark, can topics be dynamic?

Here is my data:
+---+-----------+-----+
|key|animal_type|value|
+---+-----------+-----+
|123| cat|meows|
|456| dog|barks|
+---+-----------+-----+
I am currently writing to Eventhub from databricks like so:
(df.select("key","value").writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("topic","cats").option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.sasl.jaas.config", connection_string)
.option("kafka.request.timeout.ms", "3600000")
.option("checkpointLocation", checkpoint_path)
.start() )
My challenge is, since the dataframe contains records about both cats and dogs, I don't want to hardcode the topic as 'cats' since sometimes it will be 'dogs'. Instead, I would like a way for EventHub to dynamically assign the topic based on the value in the column animal_type.
Is this possible? Or do I need to have a single dataframe/write-config per topic?
You need to alias the animal_type column to topic prior to writing, then remove the topic option.
Refer Spark Structured Streaming documentation for other ways to structure the dataframe when writing to Kafka API

Writing data as JSON array with Spark Structured Streaming

I have to write data from Spark Structure streaming as JSON Array, I have tried using below code:
df.selectExpr("to_json(struct(*)) AS value").toJSON
which returns me DataSet[String], but unable to write as JSON Array.
Current Output:
{"name":"test","id":"id"}
{"name":"test1","id":"id1"}
Expected Output:
[{"name":"test","id":"id"},{"name":"test1","id":"id1"}]
Edit (moving comments into question):
After using proposed collect_list method I am getting
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
Then I tried something like this -
withColumn("timestamp", unix_timestamp(col("event_epoch"), "MM/dd/yyyy hh:mm:ss aa")) .withWatermark("event_epoch", "1 minutes") .groupBy(col("event_epoch")) .agg(max(col("event_epoch")).alias("timestamp"))
But I don't want to add a new column.
You can use the SQL built-in function collect_list for this. This function collects and returns a set of non-unique elements (compared to collect_set which returns only unique elements).
From the source code for collect_list you will see that this is an aggregation function. Based on the requirements given in the Structured Streaming Programming Guide on Output Modes it is highlighted that the output modes "complete" and "updated" are supported for aggregations without a watermark.
As I understand from your comments, you do not wish to add watermark and new columns. Also, the error you are facing
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
reminds you to not use the output mode "append".
In the comments, you have mentioned that you plan to produce the results into a Kafka message. One big JSON Array as one Kafka value. The complete code could look like
val df = spark.readStream
.[...] // in my test I am reading from Kafka source
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "offset", "partition")
// do not forget to convert you data into a String before writing to Kafka
.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value")
df.writeStream
.format("kafka")
.outputMode("complete")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test")
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.trigger(Trigger.ProcessingTime(10000))
.start()
.awaitTermination()
Given the key/value pairs (k1,v1), (k2,v2), and (k3,v3) as inputs you will get a value in the Kafka topic that contains all selected data as a JSON Array:
[{"key":"k1","value":"v1","offset":7,"partition":0}, {"key":"k2","value":"v2","offset":8,"partition":0}, {"key":"k3","value":"v3","offset":9,"partition":0}]
Tested with Spark 3.0.1 and Kafka 2.5.0.

How to validate every row of streaming batch?

Need to validate each row of Streaming Dataframe (consumed through readStream(kafka) - Getting error
Queries with streaming sources must be executed with writeStream.start()
as it is not allowing to validate row by row
I have created spark batch job to consume data from Kafka , validated each row against HBase data, another set of validations based on rowkey and created a dataframe out of it. But here I need to handle the Kafka offset manually in the code.
To avoid the offset handling, am trying to use spark structural Streaming but there am not able to validate row by row.
writestream.foreach (foreachwriter) is good to sink with any external datasource or writing to kafka.
But in my case, I need to validate each row and create a new dataframe based on my validation. foreachwriter - process is not allowing to collect the data using other external classes/list.
Errors:
Getting this error when I tried to access the streaming dataframe to validate
Queries with streaming sources must be executed with writeStream.start();
Task is not serializable when I tried to create a list out of foreach(foreachwriter extended object). Will update with more details (as I am trying other options)
I am trying to achieve spark batch using writerstream.trigger(Trigger.once) with checkpointlocation
Updating with Spark batch and Structural Streaming Code.
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootStrap)
.option("subscribePattern", kafkaSubTopic)
.option("startingOffsets", "earliest")
//.option("endingOffsets", "latest")
.load()
rawData.collect.foreach(row => {
if (dValidate.dValidate(row)) {
validatedCandidates += (row.getString(0))
}
==================== in the above code I need to handle the offset manually for rerun so decided to use structural streaming.============
val rawData = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootStrap)
.option("subscribe", kafkaSubTopic)
.option("enable.auto.commit", "true")
.option("startingOffsets","latest")
.option("minPartitions", "10")
.option("failOnDataLoss", "true")
.load()
val sInput = new SinkInput(validatedCandidates,dValidate)
rawData.writeStream
.foreach(sInput)
.outputMode(OutputMode.Append())
.option("truncate", "false")
.trigger(Trigger.Once())
.start()
am getting "Task not serialized" error in here.
with class SinkInput , I am trying to do the same collect operation with external dValidate instance
Unless I misunderstood your case, rawData is a streaming query (a streaming Dataset) and does not support collect. The following part of your code is not correct:
rawData.collect
That's not supported and hence the exception.
You should be using foreach or foreachBatch to access rows.
Do this instead:
rawData.write.foreach(...)

How to preserve event order per key in Structured Streaming Repartitioning By Key?

I want to write a structured spark streaming Kafka consumer which reads data from a one partition Kafka topic, repartitions the incoming data by "key" to 3 spark partitions while keeping the messages ordered per key, and writes them to another Kafka topic with 3 partitions.
I used Dataframe.repartition(3, $"key") which I believe uses HashPartitioner. Code is provided below.
When I executed the query with fixed-batch interval trigger type, I visually verified the output messages were in the expected order. My assumption is that order is not guaranteed on the resulting partition. I am looking to receive some affirmation or veto on my assumption in terms of code pointers in the spark code repo or documentation.
I also tried using Dataframe.sortWithinPartitions, however this does not seem to be supported on streaming data frame without aggregation.
One option I tried was to convert the Dataframe to RDD and apply repartitionAndSortWithinPartitions which repartitions the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. However, then I cannot use this RDD in the query.writestream operation to write the result in the output Kafka topic.
Is there a data frame repartitioning API that helps sort the repartitioned data in the streaming context?
Are there any other alternatives?
Does the default trigger type or fixed-interval trigger type for micro-batch execution provide any sort of message ordering guarantees?
Incoming data:
case class KVOutput(key: String, ts: Long, value: String, spark_partition: Int)
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers.get)
.option("subscribe", Array(kafkaInputTopic.get).mkString(","))
.option("maxOffsetsPerTrigger",30)
.load()
val inputDf = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
val resDf = inputDf.repartition(3, $"key")
.select(from_json($"value", schema).as("kv"))
.selectExpr("kv.key", "kv.ts", "kv.value")
.withColumn("spark_partition", spark_partition_id())
.select($"key", $"ts", $"value", $"spark_partition").as[KVOutput]
.sortWithinPartitions($"ts", $"value")
.select($"key".cast(StringType).as("key"), to_json(struct($"*")).cast(StringType).as("value"))
val query = resDf.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers.get)
.option("topic", kafkaOutputTopic.get)
.option("checkpointLocation", checkpointLocation.get)
.start()
When I submit this application, it fails with
8/11/08 22:13:20 ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;

how to check if stop streaming from kafka topic by a limited time duration or record count?

My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.

Resources