read a data from kafka topic and aggregate using spark tempview? - apache-spark

i want a read a data from kafka topic, and create spark tempview to group by some columns?
+----+--------------------+
| key| value|
+----+--------------------+
|null|{"e":"trade","E":...|
|null|{"e":"trade","E":...|
|null|{"e":"trade","E":...|
but i can't able to aggregate data from tempview?? value column data stored as a String???
Dataset<Row> data = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092,localhost:9093")
.option("subscribe", "data2-topic")
.option("startingOffsets", "latest")
.option ("group.id", "test")
.option("enable.auto.commit", "true")
.option("auto.commit.interval.ms", "1000")
.load();
data.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
data.createOrReplaceTempView("Tempdata");
data.show();
Dataset<Row> df2=spark.sql("SELECT e FROM Tempdata group by e");
df2.show();

value column data stored as a String???
Yes.. Because you CAST(value as STRING)
You'll want to use a from_json function that'll load the row into a proper dataframe that you can search within.
See Databrick's blog on Structured Streaming on Kafka for some examples
If the primary goal is just grouping of some fields, then KSQL might be an alternative.

Related

Spark Streaming subscribe multiple topics and write into multiple topics

I have some kafkas topics with the nomenclatures below:
'ingestion_src_api_iq_BTCUSD_1_json', 'ingestion_src_api_iq_BTCUSD_5_json', 'ingestion_src_api_iq_BTCUSD_60_json'
I'm reading all these topics that has the same data structure using the "subscribePattern" param in spark.
(spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
. select(to_json(struct(expr("value.active_id as active_id"), expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"), expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"), expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume")).alias("value"))
.writeStream.format("kafka").option("kafka.bootstrap.servers", bootstrap_server)
.option("topic", "processed_src_api_iq_data")
.option("checkpointLocation", f"./checkpoint/")
.start()
)
How could I write the transformed data into differents topics like:
'processed_src_api_iq_BTCUSD_1_json', 'processed_src_api_iq_BTCUSD_5_json', 'processed_src_api_iq_BTCUSD_60_json'
In my code I am able to write only in one topic "processed_src_api_iq_data".
The outgoing dataframe to format("kafka") can include a String column named topic which will determine where the value and/or key byte/string columns will be produced to, rather than using option, as documented...
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
The topic column is required if the “topic” configuration option is not specified
Use withColumn to add the necessary values, based on the other columns that you have.
Alternatively, create multiple dataframes and call writeStream.format("kafka") with the invidiual option("topic" settings on each.
raw_df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
parsed_df = raw_df.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
processed_df = parsed_df
.select(to_json(struct(
expr("value.active_id as active_id"),
expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"),
expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"),
expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume"
)).alias("value"))
btc_1 = processed_df.filter( ... something to get just this data )
btc_5 = processed_df.filter( ... etc )
btc_1.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_1_json")
...
btc_5.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_5_json")
...

How to calculate moving average in spark structured streaming?

I am trying to calculate a moving average in a spark structured streaming in terms of rows preceding and not time-event based.
Kafka has string messages like this:
device1#227.92#2021-08-19T12:15:13.540Z
and there is this code
Dataset<Row> lines = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "users")
.load()
.selectExpr("CAST(value AS STRING)")
.map((MapFunction<Row, Row>) row -> {
String message = row.getAs("value");
String[] newRow = message.split("#");
return RowFactory.create(newRow);
}, RowEncoder.apply(structType))
.selectExpr("CAST(item AS STRING)", "CAST(value AS DOUBLE)", "CAST(timestamp AS TIMESTAMP)");
The above code reads stream from kafka and transforms string messages to rows.
When i try to do sth like this:
WindowSpec threeRowWindow = Window.partitionBy("item").orderBy("timestamp").rowsBetween(Window.currentRow(), -3);
Dataset<Row> testWindow =
lines.withColumn("avg", functions.avg("value").over(threeRowWindow));
I get this error:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
Is there any other way to calculate the moving average as every message is coming and updating it as new data comes from stream? Or any non time-based operation is by default not supported to spark structured streaming?
Thanks

What is the best way to perform multiple filter operations on spark streaming dataframe read from Kafka?

I need to apply multiple filters on a DataFrame read from a Kafka topic and publish output of each of these filter to an external system (like another Kafka topic).
I have read the kafkaDF like this
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.load()
.select(col("topic"), expr("cast(value as string) as message"))
.filter(col("message").isNotNull && col("message") =!= "")
.select(from_json(col("message"), eventsSchema).as("eventData"))
.select("eventData.*")
I am able to run a foreachBatch on this Dataframe and then iterate over the list of filters to get the filtered data which then can be published to a kafka topic, as shown below
kafkaDF.writeStream
.foreachBatch { (batch: DataFrame, _: Long) =>
// List of filters that needs to be applied
filterList.par.foreach(filterString => {
val filteredDF = batch.filter(filterString)
// Add some columns.
// Do some operations based on different filter
filteredDF.toJSON.foreach(value => {
// Publish a message to Kafka
})
})
}
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
.awaitTermination()
But, I am not sure if this is the best way given so many iterations. Is there a better way than doing it like this?
If you plan to write data from one Kafka topic into multiple Kafka topics you can create a column called "topic" in a single Dataframe when writing to Kafka. The value in this column then defines the topic in which a record will be produced. This allows you to write to as many different Kafka topics as required.
Therefore, I would just apply your filter logic as a when/otherwise condition or, if more complex, as a UDF.
Below is an example code that should get you started. Based on the value of the consumed Kafka message, a column called "topic" gets created in the filteredDf. If value = 1 then the Dataframe record gets produced into the topic called "out1", and otherwise the recod gets produced into topic called "out2".
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "partition", "offset", "timestamp")
val filteredDf = inputDf.withColumn("topic", when(filter, lit("out1")).otherwise(lit("out2")))
val query = filteredDf
.select(
col("key"),
to_json(struct(col("*"))).alias("value"),
col("topic"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/home/michael/sparkCheckpoint/1/")
.start()
query.awaitTermination()
EDIT: (I might have misunderstood your question initially)
If you just want to find a good way to apply multiple filters out of your filterList you can combine them using foldLeft:
val filter1 = col("value") === 1
val filter2 = col("key") === 1
val filterList = List(filter1, filter2)
val filterAll = filterList.tail.foldLeft(filterList.head)((f1, f2) => f1.and(f2))
println(filterAll)
((value = 1) AND (key = 1))
Then apply .filter(filterAll) to your Dataframe.

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

Is is possible to parse JSON string from Kafka topic in real time using Spark Streaming SQL?

I have a Pyspark notebook that connects to kafka broker and creates a spark writeStream called temp. The data values in Kafka topic are in json format but I'm not sure how to go about creating a spark sql table that can parse this data in real time. The only way I know is to create a copy of the table convert it into RDD or DF and parse the value into another RDD and DF. Is is possible to have this done in real time processing as the stream is being written?
Code:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("subscribe","hoteth") \
.option("startingOffsets", "earliest") \
.load()
ds = df.selectExpr("CAST (key AS STRING)", "CAST(value AS STRING)", "timestamp")
ds.writeStream.queryName("temp").format("memory").start()
spark.sql("select * from temp limit 5").show()
Output:
+----+--------------------+--------------------+
| key| value| timestamp|
+----+--------------------+--------------------+
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
+----+--------------------+--------------------+
One way I could solve this is to just lateral view json_tuple just like it is done in Hive HQL. I'm still looking for a solution that it can parse data directly from the stream so that it doesn't take extra processing time parsing using query.
spark.sql("""
select value, v1.transaction,ticker,price
from temp
lateral view json_tuple(value,"e","s","p") v1 as transaction, ticker,price
limit 5
""").show()

Resources