PySpark structured streaming apply udf to window - apache-spark

I am trying to apply a pandas udf to a window of a pyspark structured stream. The problem is that as soon as the stream has caught up with the current state all new windows only contain a single value somehow.
As you can see in the screenshot all windows after 2019-10-22T15:34:08.730+0000 only contain a single value. The code used to generate this is this:
#pandas_udf("Count long, Resampled long, Start timestamp, End timestamp", PandasUDFType.GROUPED_MAP)
def myudf(df):
df = df.dropna()
df = df.set_index("Timestamp")
df.sort_index(inplace=True)
# resample the dataframe
resampled = pd.DataFrame()
oidx = df.index
nidx = pd.date_range(oidx.min(), oidx.max(), freq="30S")
resampled["Value"] = df.Value.reindex(oidx.union(nidx)).interpolate('index').reindex(nidx)
return pd.DataFrame([[len(df.index), len(resampled.index), df.index.min(), df.index.max()]], columns=["Count", "Resampled", "Start", "End"])
predictionStream = sensorStream.withWatermark("Timestamp", "90 minutes").groupBy(col("Name"), window(col("Timestamp"), "70 minutes", "5 minutes"))
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.start()
The stream does get new values every 5 minutes. Its just that the window somehow only takes values from the last batch even though the watermark should not have expired.
Is there anything I am doing wrong ? I already tried playing with the watermark; that did have no effect on the result. I need all values of the window inside the udf.
I am running this in databricks on a cluster set to 5.5 LTS ML (includes Apache Spark 2.4.3, Scala 2.11)

It looks like you could specify the Output Mode you want for you writeStream
See documentation at Output Modes
By default it's using Append Mode:
This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink.
Try using:
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.outputMode(OutputMode.Complete) \
.start()

I found a Spark JIRA issue concerning this problem but it was closed without resolution. The bug appears to be, and I confirmed this independently on Spark version 3.1.1, that the Pandas UDF is executed on every trigger only with the data since the last trigger. So you are likely only processing a subset of the data you want to take into account on each trigger. Grouped Map Pandas UDFs do not appear to be functional for structured streaming with a delta table source. Please do follow up if you previously found a solution, otherwise I’ll just leave this here for folks that also find this thread.
Edit: There's some discussion in the Databricks forums about first doing a streaming aggregation and following that up with a Pandas UDF (that will likely expect a single record with columns containing arrays) as shown below. I tried it. It works. However, my batch duration is high and I'm uncertain how much this additional work is contributing to it.
agg_exprs = [f.collect_list('col_of_interest_1'),
f.collect_list('col_of_interest_2'),
f.collect_list('col_of_interest_3')]
intermediate_sdf = source_sdf.groupBy('time_window', ...).agg(agg_exprs)
final_sdf = intermediate_sdf.groupBy('time_window', ...).applyInPandas(func, schema)

Related

Databricks : structure stream data assignment and display

I have following stream code in a databricks notebook (python).
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("MyTest") \
.getOrCreate()
# Create a streaming DataFrame
lines = spark.readStream \
.format("delta") \
.table("myschema.streamTest")
In notebook 2, I have
def foreach_batch_function(df, epoch_id):
test = df
print(test['simplecolumn'])
display(test['simplecolumn'])
test['simplecolumn'].display
lines.writeStream.outputMode("append").foreachBatch(foreach_batch_function).format('console').start()
When I execute the above where can I see the output from the .display function? I looked inside the cluster driver logs and I don't see anything. I also don't see anything in the notebook itself when executed except a successfully initialized and executing stream. I do see that the dataframe parameter data is displayed in console but I am trying to see that assigning test was successful.
I am trying to carry out this manipulation as a precursor to time series operations over mini batches for real-time model scoring and in python - but I am struggling to get the basics right in the structured streaming world. A working model functions but executes every 10-15 minutes. I would like to make it realtime via streams and hence this question.
You're mixing different things together - I recommend to read initial parts of the structured streaming documentation or chapter 8 of Learning Spark, 2ed book (freely available from here).
You can use display function directly on the stream, like (better with checkpointLocation and maybe trigger parameters as described in documentation):
display(lines)
Regarding the scoring - usually it's done by defining the user defined function and applying it to stream either as select or withColumn functions of the dataframe. Easiest way is to register a model in the MLflow registry, and then load model with built-in functions, like:
import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
preds = lines.withColumn("predictions", pyfunc_udf(params...))
Look into that notebook for examples.

In Pyspark Structured Streaming, how can I discard already generated output before writing to Kafka?

I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.

DropDuplicates in PySpark gives stackoverflowerror

I have a PySpark program which reads a json files of size around 350-400 MB and created a dataframe out of it.
In my next step, I create a Spark SQL query using createOrReplaceTempView and select few columns as required
Once this is done, I filter my dataframe with some conditions. It was working fine until this point of time.
Now, I needed to remove some duplicate values using a column. So, I introduced,
dropDuplicates in next step and it suddenly started giving me StackoverflowError
Below is the sample code:-
def create_some_df(initial_df):
initial_df.createOrReplaceTempView('data')
original_df = spark.sql('select c1,c2,c3,c4 from data')
## Filter out some events
original_df = original_df.filter(filter1condition)
original_df = original_df.filter(filter2condition)
original_df = original_df.dropDuplicates(['c1'])
return original_df
It worked fine until I added dropDuplicates method.
I am using 3 node AWS EMR cluster c5.2xlarge
I am running PySpark using spark-submit command in YARN client mode
What I have tried
I tried adding persist and cache before calling filter, but it didn't help
EDIT - Some more details
I realise that the error appears when I invoke my write function after multiple transformation i.e first action.
If I have dropDuplicates in my transformation before I write, it fails with error.
If I do not have dropDuplicates in my transformation, write works fine.

Dropping temporary columns in Spark

Im creating a new column in a data frame and use it in subsequent transformations. Latter when I try to drop the new column it breaks the execution. When I look into the execution plan Spark optimize execution plan by removing the whole flow as because Im dropping the column in latter stage. How to drop temporary column without affecting execution plan? - Im using pyspark.
df = df.withColumn('COLUMN_1', "some transformation returns value").withColumn('COLUMN_2',"some transformation returns value")
df = df.withColumn('RESULT',when("some condition", col('COLUMN_1')).otherwise(col('COLUMN_2'))).drop('COLUMN_1','COLUMN_2')
I have tried in spark-shell(using scala) and it's working as expected
I'm using Spark 2.4.4 version and scala 2.11.12.
I have tried the same in Pyspark and refer the attachment. Let me know if this answer helps for you.
With Pyspark

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

Hi All I want to do incremental data query.
df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
read a csv/mysql Table data into spark dataframe.
Persist that dataframe in memory Only(reason: I need performance & My dataset can fit to memory)
Register as temp table and run spark sql queries. #Till this my spark job is UP and RUNNING.
Next day i will receive a incremental Dataset(in a temp_mysql_table or a csv file). Now I want to run same query on a Total set i:e persisted_prevData + recent_read_IncrementalData. i will call it mixedDataset.
*** there is no certainty that when incremental data comes to system, it can come 30 times a day.
Till here also I don't want the spark-Application to be down,. It should always be Up. And I need performance of querying mixedDataset with same time measure as if it is persisted.
My Concerns :
In P4, Do i need to unpersist the prev_data and again persist the union-Dataframe of prev&Incremantal data?
And my most important concern is i don't want to restart the Spark-JOB to load/start with Updated Data(Only if server went down, i have to restart of course).
So, on a high level, i need to query (faster performance) dataset + Incremnatal_data_if_any dynamically.
Currently i am doing this exercise by creating a folder for all the data, and incremental file also placed in the same directory. Every 2-3 hrs, i am restarting the server and my sparkApp starts with reading all the csv files present in that system. Then queries running on them.
And trying to explore hive persistentTable and Spark Streaming, will update here if found any result.
Please suggest me a way/architecture to achieve this.
Please comment, if anything is not clear on Question, without downvoting it :)
Thanks.
Try streaming instead it will be much faster since the session is already running and it will be triggered everytime you place something in the folder:
df_incremental = spark \
.readStream \
.option("sep", ",") \
.schema(input_schema) \
.csv(input_path)
df_incremental.where("column1 > 10") \
.writeStream \
.queryName("data_table") \
.format("memory") \
.start()
spark.sql("SELECT * FROM data_table).show()

Resources