Understanding pandas_udf - apache-spark

The documentation page of pandas_udf in pyspark documentation has the following paragraph:
The user-defined functions do not support conditional expressions or short circuiting in boolean expressions and it ends up with being executed all internally. If the functions can fail on special rows, the workaround is to incorporate the condition into the functions.
Can somebody explain to me what this means? It seems it is saying that the UDF does not support conditional statements (if else blocks) and then suggesting that the workaround is to include the if else condition in the function body. This does not make sense to me. Please help

I read something similar in Learning Spark - Lightning-Fast Data Analytics
In Chapter 5 - User Defined Functions it talks about evaluation order and null checking in Spark SQL.
If your UDF can fail when dealing with NULL values it's best to move that logic inside the UDF itself just like it says in the quote you provided.
Here's the reasoning behind it:
Spark SQL (this includes DataFrame API and Dataset API) does not quarantee the order of evaluation of subexpressions. For example the following query does not guarantee that the s IS NOT NULL clause is executed prior to the strlen(s):
spark.sql("SELECT s FROM test1 WHERE s IS NOT NULL AND strlen(s) > 1")
Therefore to perform proper null checking it is recommended that you make the UDF itself null-aware and do null checking inside the UDF.

Related

Filtering on NULL values in a spark dataframe does not work on all columns?

While writing this question, I managed to find an explaination. But as it seems a tricky point, I will post it and answer it anyway. Feel free to complement.
I have what appears to me an inconsistent behaviour of pyspark, but as I am quite new to it I may miss something... All my steps are run in an Azure Databricks notebook, and the data is from a parquet file hosted in Azure Datalake Gen. 2.
I want to simply filter the NULL records from a spark dataframe, created by reading a parquet file, with the following steps:
Filtering on the phone column just works fine:
We can see that at least some contact_tech_id values are also missing. But when filtering on this specific column, an empty dataframe is retrieved...
Is there any explaination on why this could happen, or what I should look for?
In order to compare the NULL values for equality, Spark provides a null-safe equal operator (<=>), which returns False when one of the operand is NULL and returns True when both the operands are NULL. Instead of using is null always recommend (<=>) operator.
Apache Spark supports the standard comparison operators such as >, >=, =, < and <=. The result of these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL. In order to compare the NULL values for equality, Spark provides a null-safe equal operator (<=>), which returns False when one of the operand is NULL and returns True when both the operands are NULL. You can ref link
The reason why filtering on contact_tech_id Null values was unsuccessful is because what appears as null in this column in the notebook output is in fact a NaNvalue ("Not a Number", see here for more information).
It appeared when converting into a pandas dataframe, for which the output is more readable:

Update a pyspark Delta Table using a python boolean function

so I have a delta table that I want to update based on a condition of two column values combined;
i.e.
delta_table.update(
condition=is_eligible(col("name"), col("age"))
set={"pension_eligible": lit("yes")}
)
I'm aware that I can do something similar to:
delta_table.update(
condition=(col("name") == "Einar") & (col("age") > 65)
set={"pension_eligible": lit("yes")}
)
But since my logic for computing this is quite complex (I need to look up the name in a database) I would like to define my own Python function for computing this (is_eligible(...)). Other reasons are because this function is used elsewhere and I would like to minimize code duplication.
Is this possible at all? As I understand you could define it as an UDF, but they only take one parameter and I need at least two. I can not find anything about more complex conditions in the delta lake documentation, so I'd really appreciate some guidance here.

PySpark: combine aggregate and window functions

I am working with a legacy Spark SQL code like this:
SELECT
column1,
max(column2),
first_value(column3),
last_value(column4)
FROM
tableA
GROUP BY
column1
ORDER BY
columnN
I am rewriting it in PySpark as below
df.groupBy(column1).agg(max(column2), first(column3), last(column4)).orderBy(columnN)
When I'm comparing the two outcomes I can see differences in the fields generated by the first_value/first and last_value/last functions.
Are they behaving in a non-deterministic way when used outside of Window functions?
Can groupBy aggregates be combined with Window functions?
This behaviour is possible when you have a wide table and you don't specify ordering for the remaining columns. What happens under the hood is that spark takes first() or last() row, whichever is available to it as the first condition-matching row on the heap. Spark SQL and pyspark might access different elements because the ordering is not specified for the remaining columns.
In terms of Window function, you can use a partitionBy(f.col('column_name')) in your Window, which kind of works like a groupBy - it groups the data according to a partitioning column. However, without specifying the ordering for all columns, you might arrive at the same problem of non-determinicity. Hope this helps!
For completeness sake, I recommend you have a look at the pyspark doc for the first() and last() functions here: https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.first
In particular, the following note brings light to why you behaviour was non-deterministic:
Note The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Definitely !
import pyspark.sql.functions as F
partition = Window.partitionBy("column1").orderBy("columnN")
data = data.withColumn("max_col2", F.max(F.col("column2")).over(partition))\
.withColumn("first_col3", F.first(F.col("column3")).over(partition))\
.withColumn("last_col4", F.last(F.col("column4")).over(partition))
data.show(10, False)

How to use values (as Column) in function (from functions object) where Scala non-SQL types are expected?

I'd like to undertand how I can dynamically add number of days to a given timestamp: I tried something similar to the example shown below. The issue here is that the second argument is expected to be of type Int, however in my case it returns type Column. How do I unbox this / get the actual value? (The code examples below might not be 100% correct as I write this from top of my head ... I don't have the actual code with me currently)
myDataset.withColumn("finalDate",date_add(col("date"),col("no_of_days")))
I tried casting:
myDataset.withColumn("finalDate",date_add(col("date"),col("no_of_days").cast(IntegerType)))
But this did not help either. So how is it possible to solve this?
I did find a workaround by using selectExpr:
myDataset.selectExpr("date_add(date,no_of_days) as finalDate")
While this works, I still would like to understand how to get the same result with withColumn.
withColumn("finalDate", expr("date_add(date,no_of_days)"))
The above syntax should work.
I think it's not possible as you'd have to use two separate similar-looking type systems - Scala's and Spark SQL's.
What you call a workaround by using selectExpr is probably the only way to do it as you're confined in a single type system, in Spark SQL's and since the parameters are all defined in Spark SQL's "realm" that's the only possible way.
myDataset.selectExpr("date_add(date,no_of_days) as finalDate")
BTW, you've just showed me another reason where support for SQL is different from Dataset's Query DSL. It's about the source of the parameters to functions -- only from structured data sources, only from Scala or a mixture thereof (as in UDFs and UDAFs). Thanks!

Dataset predicate pushdow after .as(Encoders.kryo)

Help me please to write an optimal spark query. I have read about predicate pushdown:
When you execute where or filter operators right after loading a
dataset, Spark SQL will try to push the where/filter predicate down to
the data source using a corresponding SQL query with WHERE clause (or
whatever the proper language for the data source is).
Will predicate pushdown works after .as(Encoders.kryo(MyObject.class)) operation?
spark
.read()
.parquet(params.getMyObjectsPath())
// As I understand predicate pushdown will work here
// But I should construct MyObject from org.apache.spark.sql.Row manually
.as(Encoders.kryo(MyObject.class))
// QUESTION: will predicate pushdown work here as well?
.collectAsList();
It won't work. After you use Encoders.kryo you get just a blob which doesn't really benefit from columnar storage and doesn't provide efficient (without object deserialization) access to individual fields, not to mention predicate pushdown or more advanced optimizations.
You could be better off with Encoders.bean if the MyObject class allows for that. In general to get a full advantage of Dataset optimizations you'll need at least a type which can be encoded using more specific encoder.
Related Spark 2.0 Dataset vs DataFrame

Resources