Can we use row_number() in PySpark Structured Streaming? - apache-spark

The PySpark SQL functions reference on the row_number() function says
returns a sequential number starting at 1 within a window partition
implying that the function works only on windows. Trying
df.select('*', row_number())
predictably gives a
Window function row_number() requires an OVER clause
exception.
Now, .over() seems to work only with WindowSpec because
from pyspark.sql.functions import window, row_number
...
df.select('*', row_number().over(window('time', '5 minutes')))
gives a
TypeError: window should be WindowSpec
exception.
However, according to this comment on the ASF Jira:
By time-window we described what time windows are supported in SS natively.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows
Window spec is not supported. This defines the boundary of window as non-timed manner, the offset(s) of the row, which is hard to track in streaming context.
WindowSpec is generally not supported in Structured Streaming. Leading to the conclusion that the row_number() function is not supported in Structured Streaming. Is that correct? Just want to make sure I'm not missing anything here.

first point, your imports are wrong:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
second, try doing like this:
partition_columns = Window.partitionBy(
df.column1,
df.column2,
...
).orderBy(df.col...)
df = df.withColumn('your_new_column_rank', row_number().over(partition_columns))
Usually we use Windowing functions to deduplicate records in Structured Streaming, the documentation says that is not possible to use it because this function will not access the already saved data as we can do with Batch, but you can set watermark, like this:
df = df.withWatermark("timestamp", "10 minutes").withColumn('your_new_column_rank', row_number().over(partition_columns))
or even you can try using watermark to run drop_duplicate function.
Another way to do it, is through the foreachBatch
def func(batch_df, batch_id):
partition_columns = Window.partitionBy(
df.column1,
df.column2,
...
).orderBy(batch_df.col...)
batch_df= batch_df.withColumn('your_new_column_rank',
row_number().over(partition_columns))
...
writer = sdf.writeStream.foreachBatch(func)
Like above you will have a micro df that is not a Structured Streaming DF, so there is possible to access functions that you can't with a a streaming one.

Related

How to access the SparkSession from the worker?

I'm using spark structured streaming in databricks. In that, I'm using foreach operation to perform some operations on every record of the data. But the function which I'm passing in foreach uses SparkSession but it's throwing an error: _pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.
So, Is there any way to use SparkSession inside foreach?
EDIT #1:
One example of function passed in foreach would be something like:
def process_data(row):
df = spark.createDataFrame([row])
df.write.mode("overwrite").saveAsTable("T2")
spark.sql("""
MERGE INTO T1
USING T2 ON T2.type="OrderReplace" AND T1.ReferenceNumber=T2.originalReferenceNumber
WHEN MATCHED THEN
UPDATE SET
shares = T2.shares,
price = T2.price,
ReferenceNumber = T2.newReferenceNumber
""")
So, I need SparkSession here which is not available inside foreach.
From the description of the tasks you want to perform row by row on your streaming dataset, you can try using foreachBatch.
A streaming dataset may have thousands to millions of records and if you are hitting external systems on each record, it is a massive overhead. From your example, it seems you are updating a base data set T1 based on events in T2.
From the link above, you can do something like below,
def process_data(df, epoch_id):
df.write.mode("overwrite").saveAsTable("T2")
spark.sql("""
MERGE INTO T1
USING T2 ON T2.type="OrderReplace" AND T1.ReferenceNumber=T2.originalReferenceNumber
WHEN MATCHED THEN
UPDATE SET
shares = T2.shares,
price = T2.price,
ReferenceNumber = T2.newReferenceNumber
""")
streamingDF.writeStream.foreachBatch(process_data).start()
Some additional points regarding what you are doing,
You are trying to do a foreach on a streaming data set and overwriting T2 hive table. Since records are processed parallel in clusters-managers like yarn, there might be a corrupted T2 since multiple tasks maybe updating it.
If you are persisting records in T2 (even temporarily), why don't you try making the merge/ update logic as a separate batch process? This will be similar to the foreachBatch solution I wrote above.
Updating/ Creating Hive tables fire a map/reduce job ultimately which requires resource negotiations, etc., etc. and may be painfully slow if done on each record, you may want to consider a different type of destination like a JDBC based one for example.

Databricks : structure stream data assignment and display

I have following stream code in a databricks notebook (python).
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("MyTest") \
.getOrCreate()
# Create a streaming DataFrame
lines = spark.readStream \
.format("delta") \
.table("myschema.streamTest")
In notebook 2, I have
def foreach_batch_function(df, epoch_id):
test = df
print(test['simplecolumn'])
display(test['simplecolumn'])
test['simplecolumn'].display
lines.writeStream.outputMode("append").foreachBatch(foreach_batch_function).format('console').start()
When I execute the above where can I see the output from the .display function? I looked inside the cluster driver logs and I don't see anything. I also don't see anything in the notebook itself when executed except a successfully initialized and executing stream. I do see that the dataframe parameter data is displayed in console but I am trying to see that assigning test was successful.
I am trying to carry out this manipulation as a precursor to time series operations over mini batches for real-time model scoring and in python - but I am struggling to get the basics right in the structured streaming world. A working model functions but executes every 10-15 minutes. I would like to make it realtime via streams and hence this question.
You're mixing different things together - I recommend to read initial parts of the structured streaming documentation or chapter 8 of Learning Spark, 2ed book (freely available from here).
You can use display function directly on the stream, like (better with checkpointLocation and maybe trigger parameters as described in documentation):
display(lines)
Regarding the scoring - usually it's done by defining the user defined function and applying it to stream either as select or withColumn functions of the dataframe. Easiest way is to register a model in the MLflow registry, and then load model with built-in functions, like:
import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
preds = lines.withColumn("predictions", pyfunc_udf(params...))
Look into that notebook for examples.

Spark dataframe select using SQL without createOrReplaceTempView

I'd like to perform SQL like syntax on Spark data frame df.
Let's say I need a calculation
cal_col = 113.4*col1 +41.4*col2....
What I do at the moment is either :
1/ Broadcasting as temp view:
df.createOrReplaceTempView("df_view")
df = spark.sql("select *, 113.4*col1 +41.4*col2... AS cal_col from df_view")
Question : Is there a lot of overhead by broadcasting a big df as view ? If yes, at which point it no longer makes sense ? Let's say df has 250 columns, 15Million records.
2/ Pyspark dataframe syntax, which is a bit more difficult to read and need modification from the formula :
df = df.withColumn("cal_col", 113.4*F.col("col1") + 41.4*F.col("col2")+...)
The formula may be lengthy and become difficult to read.
Question: Is there a way to write as SQL-like syntax without F.col ?
Something along the line
df = df.select("*, (113.4*col1 +41.4*col2...) as cal_col")
You can use df.selectExpr("") to write spark in SQL like syntax on your dataframe.
df.selectExpr("*, (113.4*col1 +41.4*col2...) as cal_col")
Also, a better way to do want you want instead of creating a view, is to df.persist() before your logic to send the dataframe to memory(and spill to disk- by default) and then run your selectExpr on it.
Link: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr

PySpark structured streaming apply udf to window

I am trying to apply a pandas udf to a window of a pyspark structured stream. The problem is that as soon as the stream has caught up with the current state all new windows only contain a single value somehow.
As you can see in the screenshot all windows after 2019-10-22T15:34:08.730+0000 only contain a single value. The code used to generate this is this:
#pandas_udf("Count long, Resampled long, Start timestamp, End timestamp", PandasUDFType.GROUPED_MAP)
def myudf(df):
df = df.dropna()
df = df.set_index("Timestamp")
df.sort_index(inplace=True)
# resample the dataframe
resampled = pd.DataFrame()
oidx = df.index
nidx = pd.date_range(oidx.min(), oidx.max(), freq="30S")
resampled["Value"] = df.Value.reindex(oidx.union(nidx)).interpolate('index').reindex(nidx)
return pd.DataFrame([[len(df.index), len(resampled.index), df.index.min(), df.index.max()]], columns=["Count", "Resampled", "Start", "End"])
predictionStream = sensorStream.withWatermark("Timestamp", "90 minutes").groupBy(col("Name"), window(col("Timestamp"), "70 minutes", "5 minutes"))
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.start()
The stream does get new values every 5 minutes. Its just that the window somehow only takes values from the last batch even though the watermark should not have expired.
Is there anything I am doing wrong ? I already tried playing with the watermark; that did have no effect on the result. I need all values of the window inside the udf.
I am running this in databricks on a cluster set to 5.5 LTS ML (includes Apache Spark 2.4.3, Scala 2.11)
It looks like you could specify the Output Mode you want for you writeStream
See documentation at Output Modes
By default it's using Append Mode:
This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink.
Try using:
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.outputMode(OutputMode.Complete) \
.start()
I found a Spark JIRA issue concerning this problem but it was closed without resolution. The bug appears to be, and I confirmed this independently on Spark version 3.1.1, that the Pandas UDF is executed on every trigger only with the data since the last trigger. So you are likely only processing a subset of the data you want to take into account on each trigger. Grouped Map Pandas UDFs do not appear to be functional for structured streaming with a delta table source. Please do follow up if you previously found a solution, otherwise I’ll just leave this here for folks that also find this thread.
Edit: There's some discussion in the Databricks forums about first doing a streaming aggregation and following that up with a Pandas UDF (that will likely expect a single record with columns containing arrays) as shown below. I tried it. It works. However, my batch duration is high and I'm uncertain how much this additional work is contributing to it.
agg_exprs = [f.collect_list('col_of_interest_1'),
f.collect_list('col_of_interest_2'),
f.collect_list('col_of_interest_3')]
intermediate_sdf = source_sdf.groupBy('time_window', ...).agg(agg_exprs)
final_sdf = intermediate_sdf.groupBy('time_window', ...).applyInPandas(func, schema)

how to apply Windows function in HiveQL in spark

I have seen posts discussing the usage of windows function. But i have some questions.
Since it is can only be used in HiveContext. How can i switch between SparkSQLContext and HiveContext given i am already using SparkSQLContext?
How is that possible to run a HiveQL using windows function here? I tried
df.registerTempTable("data")
from pyspark.sql import functions as F
from pyspark.sql import Window
%%hive
SELECT col1, col2, F.rank() OVER (Window.partitionBy("col1").orderBy("col3")
FROM data
and native Hive SQL
SELECT col1, col2, RANK() OVER (PARTITION BY col1 ORDER BY col3) FROM data
but neither of them works.
How can i switch between SparkSQLContext and HiveContext given i am already using SparkSQLContext?
You cannot. Spark data frames and tables are bound to a specific context. If you want to use HiveContext then use it all the way. You drag all the dependencies anyway.
How is that possible to run a HiveQL using windows function here
sqlContext = ... # HiveContext
sqlContext.sql(query)
The first query you use is simply invalid. The second one should work if you use correct context and configuration.

Resources