Spark Structured Streaming rate limit - apache-spark

I am Trying to control records per triggers in structured streaming. Is their any function for it. I tried different properties but nothing seems to be working.
import org.apache.spark.sql.streaming.Trigger
val checkpointPath = "/user/akash-singh.bisht#unilever.com/dbacademy/developer-foundations-capstone/checkpoint/orders"
// val outputPath = "/user/akash-singh.bisht#unilever.com/dbacademy/developer-foundations-capstone/raw/orders/stream"
val devicesQuery = df.writeStream
.outputMode("append")
.format("delta")
.queryName("orders")
.trigger(Trigger.ProcessingTime("1 second"))
.option("inputRowsPerSecond", 1)
.option("maxFilesPerTrigger", 1)
// .option("checkpointLocation", checkpointPath)
// .start(orders_checkpoint_path)
.option("checkpointLocation",checkpointPath)
.table("orders")

Delta uses two options maxFilesPerTrigger & maxBytesPerTrigger. You already use the first one, and it takes over the precedence over the second. The real number of records processed per trigger depends on the size of the input files and number of records inside it, as Delta processes complete files, not splitting it into multiple chunks.
But these options needs to be specified on the source Delta table, not on the sink, as you specify right now:
spark.readStream.format("delta")
.option("maxFilesPerTrigger", "1")
.load("/delta/events")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "...")
.table("orders")
Update, just to show that option works.
Generate test data in directory /Users/user/tmp/abc/:
for i in {1..100}; do echo "{\"id\":$i}" > $i.json; done
then run the test, but use foreachBatch to map what file was processed in which trigger/batch:
import pyspark.sql.functions as F
df = spark.readStream.format("json").schema("id int") \
.option("maxFilesPerTrigger", "1").load("/Users/user/tmp/abc/")
df2 = df.withColumn("file", F.input_file_name())
def feb(d, e):
d.withColumn("batch", F.lit(e)).write.format("parquet") \
.mode("append").save("2.parquet")
stream = df2.writeStream.outputMode("append").foreachBatch(feb).start()
# wait a minute or so
stream.stop()
bdf = spark.read.parquet("2.parquet")
# check content
>>> bdf.show(5, truncate=False)
+---+----------------------------------+-----+
|id |file |batch|
+---+----------------------------------+-----+
|100|file:///Users/user/tmp/abc/100.json|94 |
|99 |file:///Users/user/tmp/abc/99.json |19 |
|78 |file:///Users/user/tmp/abc/78.json |87 |
|81 |file:///Users/user/tmp/abc/81.json |89 |
|34 |file:///Users/user/tmp/abc/34.json |69 |
+---+----------------------------------+-----+
# check that each file came in a separate batch
>>> bdf.select("batch").dropDuplicates().count()
100
If I increase maxFilesPerTrigger to 2, then I'll get 50 batches, etc.

Related

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

Spark Structural Streaming Output Mode problem

I am currently looking for an applicable solution to solve the following problem using Spark Structured Streaming API. I have searched through a lot of blog posts and Stackoverflow. Unfortunately, I still can't find a solution to this. Hence raising this ticket to call for expert help.
Use Case
Let said I have a Kafka Topic (user_creation_log) that has all the real-time user_creation_event. For those users who didn't do any transaction within 10 secs, 20 secs, and 30 secs then we will assign them a certain voucher. ( time windows is shortened for testing purpose)
Flag and sending the timeout row (more than 10 sec, more than 20 sec , more than 30 secs) to Kafka is the most problematic part!!! Too much rules, or perhaps i should break it 10sec,20sec and 30 secs into different script
My Tracking Table
I am able to track user no_action_sec by no_action_10sec,no_action_20sec,no_action_30sec flag(shown in code below). The no_action_sec is derived from (current_time - creation_time) which will be calculated in every microbatch.
Complete Output Mode
outputMode("complete") writes all the rows of a Result Table (and corresponds to a traditional batch structured query).
Update Output Mode
outputMode("update") writes only the rows that were updated (every time there are updates).
In this case Update Output Mode seems very suitable because it will write an updated row to output. However, whenever the flag10, flag20, flag30 columns have been updated, the row didn't write to the desired location.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.appName("Notification") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
split_col=split(lines.value, ' ')
df = lines.withColumn('user_id', split_col.getItem(0))
df = df.withColumn('create_date_time', split_col.getItem(1)) \
.groupBy("user_id","create_date_time").count()
df = df.withColumn("create_date_time",col("create_date_time").cast(LongType())) \
.withColumn("no_action_sec", current_timestamp().cast(LongType()) -col("create_date_time").cast(LongType()) ) \
.withColumn("no_action_10sec", when(col("no_action_sec") >= 10 ,True)) \
.withColumn("no_action_20sec", when(col("no_action_sec") >= 20 ,True)) \
.withColumn("no_action_30sec", when(col("no_action_sec") >= 30 ,True)) \
query = df \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
Current Output
UserId = 0 is disappear in Batch 2. It's supposed to show up because no_action_30sec will changes from null to True.
Expected output
User Id should be write to output 3 times once it triggers the flag logic 10 sec, 20 sec and 30 sec
Can anyone shed light on this problem? Like what can i do to let rows write into output when no_action_10sec,no_action_20sec,no_action_30sec is flag to True?
Debug
OutputMode = Complete will output too much redundant data
Mock Data Generator
for i in {0..10000}; do echo "${i} $(date +%s)"; sleep 1; done | nc -lk 9999
Assume that the row has been showing up in console mode (.format("console") ) will send to Kafka for chaining action

Is is possible to parse JSON string from Kafka topic in real time using Spark Streaming SQL?

I have a Pyspark notebook that connects to kafka broker and creates a spark writeStream called temp. The data values in Kafka topic are in json format but I'm not sure how to go about creating a spark sql table that can parse this data in real time. The only way I know is to create a copy of the table convert it into RDD or DF and parse the value into another RDD and DF. Is is possible to have this done in real time processing as the stream is being written?
Code:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("subscribe","hoteth") \
.option("startingOffsets", "earliest") \
.load()
ds = df.selectExpr("CAST (key AS STRING)", "CAST(value AS STRING)", "timestamp")
ds.writeStream.queryName("temp").format("memory").start()
spark.sql("select * from temp limit 5").show()
Output:
+----+--------------------+--------------------+
| key| value| timestamp|
+----+--------------------+--------------------+
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
+----+--------------------+--------------------+
One way I could solve this is to just lateral view json_tuple just like it is done in Hive HQL. I'm still looking for a solution that it can parse data directly from the stream so that it doesn't take extra processing time parsing using query.
spark.sql("""
select value, v1.transaction,ticker,price
from temp
lateral view json_tuple(value,"e","s","p") v1 as transaction, ticker,price
limit 5
""").show()

How to write parquet files from streaming query?

I'm reading from a CSV file using Spark 2.2 structured streaming.
My query for writing the result to the console is this:
val consoleQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.writeStream
.format("console")
.option("truncate", value = false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete())
The result looks fine:
+---------------------------------------------+-------------+-----+
|window |id |count|
+---------------------------------------------+-------------+-----+
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000001|1 |
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000002|8 |
|[2017-02-17 08:00:00.0,2017-02-17 09:00:00.0]|EXC2200002 |1 |
+---------------------------------------------+-------------+-----+
But when writing it to a Parquet file
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
and reading it in with another job,
val data = spark.read.parquet("src/main/resources/parquet/")
the result is this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
TL;DR parquetQuery has not been started and so no output from a streaming query.
Check out the type of parquetQuery which is org.apache.spark.sql.streaming.DataStreamWriter which is simply a description of a query that at some point is supposed to be started. Since it was not, the query has never been able to do anything that would write the stream.
Add start at the very end of parquetQuery declaration (right after or as part of the call chain).
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
.start // <-- that's what you miss

Executing separate streaming queries in spark structured streaming

I am trying to aggregate stream with two different windows and printing it into the console. However only the first streaming query is being printed. The tenSecsQ is not printed into the console.
SparkSession spark = SparkSession
.builder()
.appName("JavaStructuredNetworkWordCountWindowed")
.config("spark.master", "local[*]")
.getOrCreate();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load();
Dataset<Row> words = lines
.as(Encoders.tuple(Encoders.STRING(), Encoders.TIMESTAMP()))
.toDF("word", "timestamp");
// 5 second window
Dataset<Row> fiveSecs = words
.groupBy(
functions.window(words.col("timestamp"), "5 seconds"),
words.col("word")
).count().orderBy("window");
// 10 second window
Dataset<Row> tenSecs = words
.groupBy(
functions.window(words.col("timestamp"), "10 seconds"),
words.col("word")
).count().orderBy("window");
Trigger Streaming Query for both 5s and 10s aggregated streams. The output for 10s stream is not printed. Only 5s is printed into console
// Start writeStream() for 5s window
StreamingQuery fiveSecQ = fiveSecs.writeStream()
.queryName("5_secs")
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start();
// Start writeStream() for 10s window
StreamingQuery tenSecsQ = tenSecs.writeStream()
.queryName("10_secs")
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start();
tenSecsQ.awaitTermination();
I've been investigating this question.
Summary: Each query in Structured Streaming consumes the source data. The socket source creates a new connection for each query defined. The behavior seen in this case is because nc is only delivering the input data to the first connection.
Henceforth, it's not possible to define multiple aggregations over the socket connection unless we can ensure that the connected socket source delivers the same data to each connection open.
I discussed this question on the Spark mailing list.
Databricks developer Shixiong Zhu answered:
Spark creates one connection for each query. The behavior you observed is because how "nc -lk" works. If you use netstat to check the tcp connections, you will see there are two connections when starting two queries. However, "nc" forwards the input to only one connection.
I verified this behavior by defining a small experiment:
First, I created a SimpleTCPWordServer that delivers random words to each connection open and a basic Structured Streaming job that declares two queries. The only difference between them is that the 2nd query defines an extra constant column to differentiate its output:
val lines = spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", "9999")
.option("includeTimestamp", true)
.load()
val q1 = lines.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
val q2 = lines.withColumn("foo", lit("foo")).writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("7 seconds"))
.start()
If StructuredStreaming would consume only one stream, then we should see the same words delivered by both queries. In the case that each query consumes a separate stream, then we will have different words reported by each query.
This is the observed output:
-------------------------------------------
Batch: 0
-------------------------------------------
+--------+-------------------+
| value| timestamp|
+--------+-------------------+
|champion|2017-08-14 13:54:51|
+--------+-------------------+
+------+-------------------+---+
| value| timestamp|foo|
+------+-------------------+---+
|belong|2017-08-14 13:54:51|foo|
+------+-------------------+---+
-------------------------------------------
Batch: 1
-------------------------------------------
+-------+-------------------+---+
| value| timestamp|foo|
+-------+-------------------+---+
| agenda|2017-08-14 13:54:52|foo|
|ceiling|2017-08-14 13:54:52|foo|
| bear|2017-08-14 13:54:53|foo|
+-------+-------------------+---+
-------------------------------------------
Batch: 1
-------------------------------------------
+----------+-------------------+
| value| timestamp|
+----------+-------------------+
| breath|2017-08-14 13:54:52|
|anticipate|2017-08-14 13:54:52|
| amazing|2017-08-14 13:54:52|
| bottle|2017-08-14 13:54:53|
| calculate|2017-08-14 13:54:53|
| asset|2017-08-14 13:54:54|
| cell|2017-08-14 13:54:54|
+----------+-------------------+
We can clearly see that the streams for each query are different. It would look like it's not possible to define multiple aggregations over the data delivered by the socket source unless we can guarantee that the TCP backend server delivers exactly the same data to each open connection.

Resources