Spark Streaming subscribe multiple topics and write into multiple topics - apache-spark

I have some kafkas topics with the nomenclatures below:
'ingestion_src_api_iq_BTCUSD_1_json', 'ingestion_src_api_iq_BTCUSD_5_json', 'ingestion_src_api_iq_BTCUSD_60_json'
I'm reading all these topics that has the same data structure using the "subscribePattern" param in spark.
(spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
. select(to_json(struct(expr("value.active_id as active_id"), expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"), expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"), expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume")).alias("value"))
.writeStream.format("kafka").option("kafka.bootstrap.servers", bootstrap_server)
.option("topic", "processed_src_api_iq_data")
.option("checkpointLocation", f"./checkpoint/")
.start()
)
How could I write the transformed data into differents topics like:
'processed_src_api_iq_BTCUSD_1_json', 'processed_src_api_iq_BTCUSD_5_json', 'processed_src_api_iq_BTCUSD_60_json'
In my code I am able to write only in one topic "processed_src_api_iq_data".

The outgoing dataframe to format("kafka") can include a String column named topic which will determine where the value and/or key byte/string columns will be produced to, rather than using option, as documented...
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
The topic column is required if the “topic” configuration option is not specified
Use withColumn to add the necessary values, based on the other columns that you have.
Alternatively, create multiple dataframes and call writeStream.format("kafka") with the invidiual option("topic" settings on each.
raw_df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
parsed_df = raw_df.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
processed_df = parsed_df
.select(to_json(struct(
expr("value.active_id as active_id"),
expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"),
expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"),
expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume"
)).alias("value"))
btc_1 = processed_df.filter( ... something to get just this data )
btc_5 = processed_df.filter( ... etc )
btc_1.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_1_json")
...
btc_5.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_5_json")
...

Related

What is the best way to perform multiple filter operations on spark streaming dataframe read from Kafka?

I need to apply multiple filters on a DataFrame read from a Kafka topic and publish output of each of these filter to an external system (like another Kafka topic).
I have read the kafkaDF like this
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.load()
.select(col("topic"), expr("cast(value as string) as message"))
.filter(col("message").isNotNull && col("message") =!= "")
.select(from_json(col("message"), eventsSchema).as("eventData"))
.select("eventData.*")
I am able to run a foreachBatch on this Dataframe and then iterate over the list of filters to get the filtered data which then can be published to a kafka topic, as shown below
kafkaDF.writeStream
.foreachBatch { (batch: DataFrame, _: Long) =>
// List of filters that needs to be applied
filterList.par.foreach(filterString => {
val filteredDF = batch.filter(filterString)
// Add some columns.
// Do some operations based on different filter
filteredDF.toJSON.foreach(value => {
// Publish a message to Kafka
})
})
}
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
.awaitTermination()
But, I am not sure if this is the best way given so many iterations. Is there a better way than doing it like this?
If you plan to write data from one Kafka topic into multiple Kafka topics you can create a column called "topic" in a single Dataframe when writing to Kafka. The value in this column then defines the topic in which a record will be produced. This allows you to write to as many different Kafka topics as required.
Therefore, I would just apply your filter logic as a when/otherwise condition or, if more complex, as a UDF.
Below is an example code that should get you started. Based on the value of the consumed Kafka message, a column called "topic" gets created in the filteredDf. If value = 1 then the Dataframe record gets produced into the topic called "out1", and otherwise the recod gets produced into topic called "out2".
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "partition", "offset", "timestamp")
val filteredDf = inputDf.withColumn("topic", when(filter, lit("out1")).otherwise(lit("out2")))
val query = filteredDf
.select(
col("key"),
to_json(struct(col("*"))).alias("value"),
col("topic"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/home/michael/sparkCheckpoint/1/")
.start()
query.awaitTermination()
EDIT: (I might have misunderstood your question initially)
If you just want to find a good way to apply multiple filters out of your filterList you can combine them using foldLeft:
val filter1 = col("value") === 1
val filter2 = col("key") === 1
val filterList = List(filter1, filter2)
val filterAll = filterList.tail.foldLeft(filterList.head)((f1, f2) => f1.and(f2))
println(filterAll)
((value = 1) AND (key = 1))
Then apply .filter(filterAll) to your Dataframe.

Send data to Kafka topics based on a condition in Dataframe

I want to change the Kafka topic destination to save the data depending on the value of the data in SparkStreaming.
Is it possible to do so again?
When I tried the following code, it only executes the first one, but does not execute the lower process.
(testdf
.filter(f.col("value") == "A")
.selectExpr("CAST(value as STRING) as value")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_1")
.option("kafka.bootstrap.servers","~~:9092")
.option("topic", "test")
.option("startingOffsets", "latest")
.start()
)
(testdf
.filter(f.col("value") == "B")
.selectExpr("CAST(value as STRING) as value")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_2")
.option("kafka.bootstrap.servers","~~:9092")
.option("topic", "testB")
.option("startingOffsets", "latest")
.start()
)
Data is stored in the topic name test.
Can anyone think of a way to do this?
I changed the destination to save such a data frame.
|type|value|
| A |testvalue|
| B |testvalue|
type A to topic test.
type B to topic testB.
With the latest versions of Spark, you could just create a column topic in your dataframe which is used to direct the record into the corresponding topic.
In your case it would mean you can do something like
testdf
.withColumn("topic", when(f.col("value") == "A", lit("test")).otherwise(lit("testB"))
.selectExpr("CAST(value as STRING) as value", "topic")
.writeStream .format("kafka")
.option("checkpointLocation", "/checkpoint_1")
.option("kafka.bootstrap.servers","~~:9092")
.start()
thx mike.
I was able to achieve this by running the following code!
(
testdf
.withColumn("topic",f.when(f.col("testTime") == "A", f.lit("test")).otherwise(("testB")))
.selectExpr("CAST(value as STRING) as value", "topic")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_2")
.option("startingOffsets", "latest")
.option("kafka.bootstrap.servers","9092")
.start()
)

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

Read Kafka topic tail in Spark

I need to subscribe to Kafka topic latest offset, read some newest records, print them and finish. How can I do this in Spark? I suppose I could do something like this
sqlContext
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", "latest")
.filter($"someField" === "someValue")
.take(10)
.show
You need to be aware in advance until which offsets in which partitions you want to consume from Kafka. If you have that information, you can do something like:
// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", """{"myTopic":{"0":20,"1":20}}""")
.option("endingOffsets", """{"myTopic":{"0":25,"1":25}}""")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.filter(...)
More details on the startingOffsets and endingOffsets are given in the Kafka + Spark Integration Guide

read a data from kafka topic and aggregate using spark tempview?

i want a read a data from kafka topic, and create spark tempview to group by some columns?
+----+--------------------+
| key| value|
+----+--------------------+
|null|{"e":"trade","E":...|
|null|{"e":"trade","E":...|
|null|{"e":"trade","E":...|
but i can't able to aggregate data from tempview?? value column data stored as a String???
Dataset<Row> data = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092,localhost:9093")
.option("subscribe", "data2-topic")
.option("startingOffsets", "latest")
.option ("group.id", "test")
.option("enable.auto.commit", "true")
.option("auto.commit.interval.ms", "1000")
.load();
data.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
data.createOrReplaceTempView("Tempdata");
data.show();
Dataset<Row> df2=spark.sql("SELECT e FROM Tempdata group by e");
df2.show();
value column data stored as a String???
Yes.. Because you CAST(value as STRING)
You'll want to use a from_json function that'll load the row into a proper dataframe that you can search within.
See Databrick's blog on Structured Streaming on Kafka for some examples
If the primary goal is just grouping of some fields, then KSQL might be an alternative.

Resources