Apply multiple functions over spark window rows - apache-spark

My understanding of spark windows is as follows:
current row (1) -> window rows (1 or more) -> aggregation func. -> output for the current row (1)
where a single row can be included in multiple windows. The aggregation function f is called with f.over(window), which limits the window scope to only a single function. For example, I cannot apply filter(), especially not a dynamic one, on only window rows before aggregating with sum().ower(window).
To do custom processing of the window rows, I can:
a) write UDF which gets window rows as input
b) use collect_list() to get window rows as a list for each row and continue processing on these lists
Is there any other option to use multiple standard spark functions on the same window rows?

The filter usecase can be achieved by applying sum over a conditional expression. It's possible to use multiple spark functions over the same window. For example, the below spark snippet is a valid.
(df.withColumn("a", f.sum().over(window))
.withColumn("b", f.first().over(window))
)
If you are looking to apply custom functions then you can write User Defined Aggregate Function (UDAF) using Scala or Java. In your only option is python then collect_list and UDF is the way to go.

Related

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:
dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()
I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.
It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.
First read your CSV as a pcollection and then you can use GroupByKey and process the grouped elements and yield the results with a separate transformation.
It could be something like this
(pcollection | 'Group by key' >> beam.GroupByKey()
| 'Process grouped elements' >> beam.ParDo(UngroupElements()))
The input pcollection should be list of tuples each one contains the key you want to group with and the element.
And the ptransformation would look like this:
class UngroupElements(beam.ParDo):
def process(element):
k, v = element
for elem in list(v):
# process your element
yield elem
You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/
You can use read_csv to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.

Data Flow - Window Transformation - NTILE Expression

I'm attempting to assign quartiles to a numeric source data range as it transits a data flow.
I gather that this can be accomplished by using the ntile expression within a window transform.
I'm failing in my attempt to use the documentation provided here to get any success.
This is just a basic attempt to understand the implementation before using it for real application. I have a numeric value in my source dataset, and I want the values within the range to be spread across 4 buckets and defined as such.
Thanks in advance for any assistance with this.
In Window transformation of Data Flow, we can configure the settings keeping the source data numeric column in “Sort” tab as shown below:
Next in Window columns tab, create a new column and write expression as “nTile(4)” in order to create 4 buckets:
In the Data Preview we can see that the data is spread across 4 Buckets:

What is the best way to create a new Spark dataframe column based on an existing column that requires an external API call?

I have a dataframe that I am working with in a Python based Jupyter notebook. I want to add an additional column based on the content of an existing column, where the content of the new column is derived from running an external API call on the original column.
The solution I attempted was to use a Python based UDF. The first cell contains something like this:
def analysis(old_column):
new_column = myapi.analyze(text=old_column)
return(new_column)
analysis_udf = udf(analysis)
and the second cell this:
df2 = df1.withColumn("col2",analysis_udf('col1'))
df2.select('col2').show(n=5)
My dataframe is relatively large, with some 70000 rows, and where col1 can have a 100 to 10000+ characters of text. When I ran the code above in cell 2, it actually seemed to run fairly quickly (minutes), and dumped out the 5 rows of the df2 dataframe. So I thought I was in business. However, my next cell had the following code:
df2.cache()
df2.filter(col('col2').isNull()).count()
The intent of this code is to cache the contents of the new dataframe to improve access time to the DF, and then count how many of the entries in the dataframe have null values generated by the UDF. This surprisingly (to me) took many hours to run, and eventually provided an output of 6. Its not clear to me why the second cell ran quickly and the third was slow. I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on all of the rows, and that one would have been slow, and then subsequent calls to access the new column of the dataframe would be quick. But that wasn't the case, so I supposed then that the cache call was the one that was actually causing the UDF to run on all of the rows so any subsequent calls now should be quick. So added another cell with:
df2.show(n=5)
Assuming it would run quickly, but again, it was taking much longer than I expected and it seems like perhaps the UDF was running again. (?)
My questions are
Which Spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on
It is not a correct assumption. Spark will evaluate as little data as possible, given limitations of the API. Because you use Python udf it will evaluate minimum number of partitions required to collect 5 rows.
Which spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
Any evaluation, if data is no longer cached (evicted from memory).
Possibly any usage of the resulting column, unless udf is marked as non-deterministic.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
Unless you want to switch to Scala or RDD API, the only alternative is pandas_udf, which is somewhat more efficient, but supports only a limited subset of types.

Spark window function without orderBy

I have a DataFrame with columns a, b for which I want to partition the data by a using a window function, and then give unique indices for b
val window_filter = Window.partitionBy($"a").orderBy($"b".desc)
withColumn("uid", row_number().over(window_filter))
But for this use-case, ordering by b is unneeded and may be time consuming. How can I achieve this without ordering?
row_number() without order by or with order by constant has non-deterministic behavior and may produce different results for the same rows from run to run due to parallel processing. The same may happen if the order by column does not change, the order of rows may be different from run to run and you will get different results.

Cassandra aggregate function "first_row"

I know, that Cassandra allows to GROUP BY and can run UDF on that data.
Is there any default function to get the first row of each aggregated set?
(How) Can I stop processing data and return result from my UDF immediately (E.G. After 1 or few rows processed)?
Now I'm using ... COUNT(1) ... as workaround.
Actualy You fon't need any UDF. It works as described out of the box.
Jusr GROUP BY fields you need.

Resources