Sending json events to kafka in non-stringified format - apache-spark

I have created a dataframe like below, where I have used to_json() method to create JSON array value.
+----------------------------------------------------------------------------------------------------
|json_data |
+-----------------------------------------------------------------------------------------------------------+
|{"name":"sensor1","value-array":[{"time":"2020-11-27T01:01:00.000Z","sensorvalue":11.0,"tag1":"tagvalue"}]}|
+-----------------------------------------------------------------------------------------------------------+
I am using the below method to send the dataframe to a kafka topic.
But when I consume the data which has been sent to the kafka topic, I could see the json data got stringified.
Code to push the data to kafka:
outgoingDF.selectExpr("CAST(Key as STRING) as key", "to_json(struct(*)) AS value")
.write
.format("kafka")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9093")
.option("checkpointLocation", checkpointPath)
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_SSL")
.option("truncate", false)
.save()
Stringified data being received in kafka:
{
"name": "sensor1",
"value-array": "[{\"time\":\"2020-11-27T01:01:00.000Z\",\"sensorvalue\":11.0,\"tag1\":\"tagvalue\"}]"
}
How can we send the data to kafka topic, so that we dont see stringified jsons as output ?

json_data is of type string & again you are passing json_data to
to_json(struct("*")) function.
Check value column which is going to kafka.
df.withColumn("value",to_json(struct($"*"))).show(false)
+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |value |
+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
|{"name":"sensor1","value-array":[{"time":"2020-11-27T01:01:00.000Z","sensorvalue":11.0,"tag1":"tagvalue"}]}|{"json_data":"{\"name\":\"sensor1\",\"value-array\":[{\"time\":\"2020-11-27T01:01:00.000Z\",\"sensorvalue\":11.0,\"tag1\":\"tagvalue\"}]}"}|
+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
Try below code.
df
.withColumn("value-array",array(struct($"time",$"sensorvalue",$"tag1")))
.selectExpr("CAST(Key as STRING) as key",to_json(struct($"name",$"value-array")).as("value"))
.write
.format("kafka")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9093")
.option("checkpointLocation", checkpointPath)
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_SSL")
.option("truncate", false)
.save()

Related

Spark Streaming subscribe multiple topics and write into multiple topics

I have some kafkas topics with the nomenclatures below:
'ingestion_src_api_iq_BTCUSD_1_json', 'ingestion_src_api_iq_BTCUSD_5_json', 'ingestion_src_api_iq_BTCUSD_60_json'
I'm reading all these topics that has the same data structure using the "subscribePattern" param in spark.
(spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
. select(to_json(struct(expr("value.active_id as active_id"), expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"), expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"), expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume")).alias("value"))
.writeStream.format("kafka").option("kafka.bootstrap.servers", bootstrap_server)
.option("topic", "processed_src_api_iq_data")
.option("checkpointLocation", f"./checkpoint/")
.start()
)
How could I write the transformed data into differents topics like:
'processed_src_api_iq_BTCUSD_1_json', 'processed_src_api_iq_BTCUSD_5_json', 'processed_src_api_iq_BTCUSD_60_json'
In my code I am able to write only in one topic "processed_src_api_iq_data".
The outgoing dataframe to format("kafka") can include a String column named topic which will determine where the value and/or key byte/string columns will be produced to, rather than using option, as documented...
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
The topic column is required if the “topic” configuration option is not specified
Use withColumn to add the necessary values, based on the other columns that you have.
Alternatively, create multiple dataframes and call writeStream.format("kafka") with the invidiual option("topic" settings on each.
raw_df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
parsed_df = raw_df.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
processed_df = parsed_df
.select(to_json(struct(
expr("value.active_id as active_id"),
expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"),
expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"),
expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume"
)).alias("value"))
btc_1 = processed_df.filter( ... something to get just this data )
btc_5 = processed_df.filter( ... etc )
btc_1.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_1_json")
...
btc_5.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_5_json")
...

How to guarantee sequence of execution of multiple sinks in spark structured streaming

In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2

Send data to Kafka topics based on a condition in Dataframe

I want to change the Kafka topic destination to save the data depending on the value of the data in SparkStreaming.
Is it possible to do so again?
When I tried the following code, it only executes the first one, but does not execute the lower process.
(testdf
.filter(f.col("value") == "A")
.selectExpr("CAST(value as STRING) as value")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_1")
.option("kafka.bootstrap.servers","~~:9092")
.option("topic", "test")
.option("startingOffsets", "latest")
.start()
)
(testdf
.filter(f.col("value") == "B")
.selectExpr("CAST(value as STRING) as value")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_2")
.option("kafka.bootstrap.servers","~~:9092")
.option("topic", "testB")
.option("startingOffsets", "latest")
.start()
)
Data is stored in the topic name test.
Can anyone think of a way to do this?
I changed the destination to save such a data frame.
|type|value|
| A |testvalue|
| B |testvalue|
type A to topic test.
type B to topic testB.
With the latest versions of Spark, you could just create a column topic in your dataframe which is used to direct the record into the corresponding topic.
In your case it would mean you can do something like
testdf
.withColumn("topic", when(f.col("value") == "A", lit("test")).otherwise(lit("testB"))
.selectExpr("CAST(value as STRING) as value", "topic")
.writeStream .format("kafka")
.option("checkpointLocation", "/checkpoint_1")
.option("kafka.bootstrap.servers","~~:9092")
.start()
thx mike.
I was able to achieve this by running the following code!
(
testdf
.withColumn("topic",f.when(f.col("testTime") == "A", f.lit("test")).otherwise(("testB")))
.selectExpr("CAST(value as STRING) as value", "topic")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_2")
.option("startingOffsets", "latest")
.option("kafka.bootstrap.servers","9092")
.start()
)

Read Kafka topic tail in Spark

I need to subscribe to Kafka topic latest offset, read some newest records, print them and finish. How can I do this in Spark? I suppose I could do something like this
sqlContext
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", "latest")
.filter($"someField" === "someValue")
.take(10)
.show
You need to be aware in advance until which offsets in which partitions you want to consume from Kafka. If you have that information, you can do something like:
// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", """{"myTopic":{"0":20,"1":20}}""")
.option("endingOffsets", """{"myTopic":{"0":25,"1":25}}""")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.filter(...)
More details on the startingOffsets and endingOffsets are given in the Kafka + Spark Integration Guide

Pyspark Kafka structured streaming: error while writing out

I am able to read a stream from a Kafka topic and write the (transformed) data back to another Kafka topic in two different steps in PySpark. The code to do that is as follows:
# Define Stream:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "instream") \
.load()
# Transform
matchdata = df.select(from_json(F.col("value").cast("string"),schema).alias("value"))\
.select(F.col('value').cast("string"))
# Stream the data, from a Kafka topic to a Spark in-memory table
query = matchdata \
.writeStream \
.format("memory") \
.queryName("PositionTable") \
.outputMode("append") \
.start()
query.awaitTermination(5)
# Create a new dataframe after stream completes:
tmp_df=spark.sql("select * from PositionTable")
# Write data to a different Kafka topic
tmp_df \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.save()
The code above works as expected: the data in Kafka topic "instream" is read in PySpark, and then PySpark can write out data to Kafka topic "outstream".
However, I would like to read the stream in and write the transformed data back out immediately (the stream will be unbounded and we would like insights immediately as the data rolls in). Following the documentation, I replaced the query above with the following:
query = matchdata \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.option("checkpointLocation", "/path/to/HDFS/dir") \
.start()
This does not appear to work.
There is no error message, so I do not know what is wrong. I've also tried windowing and aggregating within windows, but that also does not work. Any advice will be appreciated!
Ok, I found the problem. The main reason was that the subdirectory "path/to/HDFS/dir" has to exist. After creating that directory the code ran as expected. It would have been nice if an error message stated something along those lines.

Resources