Spark Structural Streaming Output Mode problem - apache-spark

I am currently looking for an applicable solution to solve the following problem using Spark Structured Streaming API. I have searched through a lot of blog posts and Stackoverflow. Unfortunately, I still can't find a solution to this. Hence raising this ticket to call for expert help.
Use Case
Let said I have a Kafka Topic (user_creation_log) that has all the real-time user_creation_event. For those users who didn't do any transaction within 10 secs, 20 secs, and 30 secs then we will assign them a certain voucher. ( time windows is shortened for testing purpose)
Flag and sending the timeout row (more than 10 sec, more than 20 sec , more than 30 secs) to Kafka is the most problematic part!!! Too much rules, or perhaps i should break it 10sec,20sec and 30 secs into different script
My Tracking Table
I am able to track user no_action_sec by no_action_10sec,no_action_20sec,no_action_30sec flag(shown in code below). The no_action_sec is derived from (current_time - creation_time) which will be calculated in every microbatch.
Complete Output Mode
outputMode("complete") writes all the rows of a Result Table (and corresponds to a traditional batch structured query).
Update Output Mode
outputMode("update") writes only the rows that were updated (every time there are updates).
In this case Update Output Mode seems very suitable because it will write an updated row to output. However, whenever the flag10, flag20, flag30 columns have been updated, the row didn't write to the desired location.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.appName("Notification") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
split_col=split(lines.value, ' ')
df = lines.withColumn('user_id', split_col.getItem(0))
df = df.withColumn('create_date_time', split_col.getItem(1)) \
.groupBy("user_id","create_date_time").count()
df = df.withColumn("create_date_time",col("create_date_time").cast(LongType())) \
.withColumn("no_action_sec", current_timestamp().cast(LongType()) -col("create_date_time").cast(LongType()) ) \
.withColumn("no_action_10sec", when(col("no_action_sec") >= 10 ,True)) \
.withColumn("no_action_20sec", when(col("no_action_sec") >= 20 ,True)) \
.withColumn("no_action_30sec", when(col("no_action_sec") >= 30 ,True)) \
query = df \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
Current Output
UserId = 0 is disappear in Batch 2. It's supposed to show up because no_action_30sec will changes from null to True.
Expected output
User Id should be write to output 3 times once it triggers the flag logic 10 sec, 20 sec and 30 sec
Can anyone shed light on this problem? Like what can i do to let rows write into output when no_action_10sec,no_action_20sec,no_action_30sec is flag to True?
Debug
OutputMode = Complete will output too much redundant data
Mock Data Generator
for i in {0..10000}; do echo "${i} $(date +%s)"; sleep 1; done | nc -lk 9999
Assume that the row has been showing up in console mode (.format("console") ) will send to Kafka for chaining action

Related

Spark Structured Streaming rate limit

I am Trying to control records per triggers in structured streaming. Is their any function for it. I tried different properties but nothing seems to be working.
import org.apache.spark.sql.streaming.Trigger
val checkpointPath = "/user/akash-singh.bisht#unilever.com/dbacademy/developer-foundations-capstone/checkpoint/orders"
// val outputPath = "/user/akash-singh.bisht#unilever.com/dbacademy/developer-foundations-capstone/raw/orders/stream"
val devicesQuery = df.writeStream
.outputMode("append")
.format("delta")
.queryName("orders")
.trigger(Trigger.ProcessingTime("1 second"))
.option("inputRowsPerSecond", 1)
.option("maxFilesPerTrigger", 1)
// .option("checkpointLocation", checkpointPath)
// .start(orders_checkpoint_path)
.option("checkpointLocation",checkpointPath)
.table("orders")
Delta uses two options maxFilesPerTrigger & maxBytesPerTrigger. You already use the first one, and it takes over the precedence over the second. The real number of records processed per trigger depends on the size of the input files and number of records inside it, as Delta processes complete files, not splitting it into multiple chunks.
But these options needs to be specified on the source Delta table, not on the sink, as you specify right now:
spark.readStream.format("delta")
.option("maxFilesPerTrigger", "1")
.load("/delta/events")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "...")
.table("orders")
Update, just to show that option works.
Generate test data in directory /Users/user/tmp/abc/:
for i in {1..100}; do echo "{\"id\":$i}" > $i.json; done
then run the test, but use foreachBatch to map what file was processed in which trigger/batch:
import pyspark.sql.functions as F
df = spark.readStream.format("json").schema("id int") \
.option("maxFilesPerTrigger", "1").load("/Users/user/tmp/abc/")
df2 = df.withColumn("file", F.input_file_name())
def feb(d, e):
d.withColumn("batch", F.lit(e)).write.format("parquet") \
.mode("append").save("2.parquet")
stream = df2.writeStream.outputMode("append").foreachBatch(feb).start()
# wait a minute or so
stream.stop()
bdf = spark.read.parquet("2.parquet")
# check content
>>> bdf.show(5, truncate=False)
+---+----------------------------------+-----+
|id |file |batch|
+---+----------------------------------+-----+
|100|file:///Users/user/tmp/abc/100.json|94 |
|99 |file:///Users/user/tmp/abc/99.json |19 |
|78 |file:///Users/user/tmp/abc/78.json |87 |
|81 |file:///Users/user/tmp/abc/81.json |89 |
|34 |file:///Users/user/tmp/abc/34.json |69 |
+---+----------------------------------+-----+
# check that each file came in a separate batch
>>> bdf.select("batch").dropDuplicates().count()
100
If I increase maxFilesPerTrigger to 2, then I'll get 50 batches, etc.

Is there a way to ensure scale of records while streaming from kafka?

I'm new to Spark and Kafka, using pyspark (spark 2.4.8).
Assume we have a kafka streaming source and we want to stream at least N records to our database. What is the best way to ensure the wanted number of records and stop after reaching it?
I thought maybe to count the number of micro-batches using a global parameter, and to limit the number of offsets per micro-batch but I guess it isn't the right way to get over the problem.
My code in general:
raw_stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_server) \
.option("subscribe", "topic1, topic2") \
.option("startingOffsets", "earliest") \
.option("maxOffsetsPerTrigger", offsets_number) \
.load()
...
# define schema (not relevant)
...
counter = 0
def foreach_batch_function(df, epoch_id):
global counter
counter += 1
query = streaming_df \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("query1") \
.foreachBatch(foreach_batch_function) \
.start()
Buy It didn't work. I tried to stop the query after reaching a const number of micro-batches but the counter even didn't increase.
Back to my question, what is the right way to pass the lower bound of requested records and than just stop?

Pyspark data aggregation with Window and sliding interval on index

I am currently running into the issue where I want to use a window and sliding interval on my csv and for each window perform data aggregation to get the most common category. However I do not have a timestamp and I want to perform the window sliding on the index column. Can anyone point me in the right direction on how to use windows + sliding intervals on the index?
In short i want to create windows+intervals over the index column.
Currently I have something like this:
schema = StructType().add("index", "string").add(
"Category", "integer")
dataframe = spark \
.readStream \
.option("sep", ",") \
.schema(schema) \
.csv("./tmp/input")
# TODO perform Window + sliding interval on dataframe, then perform aggregation per window
aggr = dataframe.groupBy("Category").count().orderBy("count", ascending=False).limit(3)
query = aggr \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
For aggregation data on per window basis you can use window function from pyspark.sql.functions package.
For time interval you need to add a timestamp column in your dataframe.
newDf = csvFile.withColumn("TimeStamp", current_timestamp())
This code adds the current time in the dataframe as the data is read from the csv.
trimmedDf2 = newDf.groupBy(window(col("TimeStamp"), "5 seconds")).agg(sum("value")).select("window.start", "window.end", "sum(value)")
display(trimmedDf2)
The above code sums up the value columns and group them in the 5 second timestamp window.
Here is the Output of the code
Weekly Aggregation using Windows Function in Spark
You can also use the above link for reference.

Databricks: Structured Stream fails with TimeoutException

I want to create a structured stream in databricks with a kafka source.
I followed the instructions as described here. My script seems to start, however it fails with the first element of the stream. The stream itsellf works fine and produces results and works (in databricks) when I use confluent_kafka, thus there seems to be a different issue I am missing:
After the initial stream is processed, the script times out:
java.util.concurrent.TimeoutException: Stream Execution thread for stream [id = 80afdeed-9266-4db4-85fa-66ccf261aee4,
runId = b564c626-9c74-42a8-8066-f1f16c7ab53d] failed to stop within 36000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`
WHAT I TRIED: looking at SO and finding this answer, to which I included
spark.conf.set("spark.sql.streaming.stopTimeout", 36000)
into my setup - which changed nothing.
Any input is highly appreciated!
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Define a data schema
schema = StructType() \
.add('PARAMETERS_TEXTVALUES_070_VALUES', StringType())\
.add('ID', StringType())\
.add('PARAMETERS_TEXTVALUES_001_VALUES', StringType())\
.add('TIMESTAMP', TimestampType())
df = spark \
.readStream \
.format("kafka") \
.option("host", "stream.xxx.com") \
.option("port", 12345)\
.option('kafka.bootstrap.servers', 'stream.xxx.com:12345') \
.option('subscribe', 'stream_test.json') \
.option("startingOffset", "earliest") \
.load()
df_word = df.select(F.col('key').cast('string'),
F.from_json(F.col('value').cast('string'), schema).alias("parsed_value"))
df_word \
.writeStream \
.format("parquet") \
.option("path", "dbfs:/mnt/streamfolder/stream/") \
.option("checkpointLocation", "dbfs:/mnt/streamfolder/check/") \
.outputMode("append") \
.start()
my stream output data looks like this:
"PARAMETERS_TEXTVALUES_070_VALUES":'something'
"ID":"47575963333908"
"PARAMETERS_TEXTVALUES_001_VALUES":12345
"TIMESTAMP": "2020-10-22T15:06:42.507+02:00"
Furthermore, stream and check folders are filled with 0-b files, except for metadata, which includes the ìd from the error above.
Thanks and stay safe.

Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'

at the code
if not df.head(1).isEmpty:
I got exception,
Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
I do not know how to use if in streaming data.
when I use jupyter, to execute each line, the code is well, and I can got my result. but use .py it's not good.
my perpose is this: I want use streaming to get data from kafka every one second, then I transform every batch steaming data(one batch means the data one second I get) to pandas dataframe, and then I use pandas function to do something to the data, finally I send the result to other kafka topic.
Please help me, and forgive my pool english, Thanks a lot.
sc = SparkContext("local[2]", "OdometryConsumer")
spark = SparkSession(sparkContext=sc) \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "data") \
.load()
ds = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print(type(ds))
if not df.head(1).isEmpty:
alertQuery = ds \
.writeStream \
.queryName("qalerts")\
.format("memory")\
.start()
alerts = spark.sql("select * from qalerts")
pdAlerts = alerts.toPandas()
a = pdAlerts['value'].tolist()
d = []
for i in a:
x = json.loads(i)
d.append(x)
df = pd.DataFrame(d)
print(df)
ds = df['jobID'].unique().tolist()
dics = {}
for source in ds:
ids = df.loc[df['jobID'] == source, 'id'].tolist()
dics[source]=ids
print(dics)
query = ds \
.writeStream \
.queryName("tableName") \
.format("console") \
.start()
query.awaitTermination()
Remove if not df.head(1).isEmpty: and you should be fine.
The reason for the exception is simple, i.e. a streaming query is a structured query that never ends and is continually executed. It is simply not possible to look at a single element since there is no "single element", but (possibly) thousands of elements and it'd be hard to tell when exactly you'd like to look under the covers and see just a single element.

Resources