Update a pyspark Delta Table using a python boolean function - apache-spark

so I have a delta table that I want to update based on a condition of two column values combined;
i.e.
delta_table.update(
condition=is_eligible(col("name"), col("age"))
set={"pension_eligible": lit("yes")}
)
I'm aware that I can do something similar to:
delta_table.update(
condition=(col("name") == "Einar") & (col("age") > 65)
set={"pension_eligible": lit("yes")}
)
But since my logic for computing this is quite complex (I need to look up the name in a database) I would like to define my own Python function for computing this (is_eligible(...)). Other reasons are because this function is used elsewhere and I would like to minimize code duplication.
Is this possible at all? As I understand you could define it as an UDF, but they only take one parameter and I need at least two. I can not find anything about more complex conditions in the delta lake documentation, so I'd really appreciate some guidance here.

Related

PySpark: combine aggregate and window functions

I am working with a legacy Spark SQL code like this:
SELECT
column1,
max(column2),
first_value(column3),
last_value(column4)
FROM
tableA
GROUP BY
column1
ORDER BY
columnN
I am rewriting it in PySpark as below
df.groupBy(column1).agg(max(column2), first(column3), last(column4)).orderBy(columnN)
When I'm comparing the two outcomes I can see differences in the fields generated by the first_value/first and last_value/last functions.
Are they behaving in a non-deterministic way when used outside of Window functions?
Can groupBy aggregates be combined with Window functions?
This behaviour is possible when you have a wide table and you don't specify ordering for the remaining columns. What happens under the hood is that spark takes first() or last() row, whichever is available to it as the first condition-matching row on the heap. Spark SQL and pyspark might access different elements because the ordering is not specified for the remaining columns.
In terms of Window function, you can use a partitionBy(f.col('column_name')) in your Window, which kind of works like a groupBy - it groups the data according to a partitioning column. However, without specifying the ordering for all columns, you might arrive at the same problem of non-determinicity. Hope this helps!
For completeness sake, I recommend you have a look at the pyspark doc for the first() and last() functions here: https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.first
In particular, the following note brings light to why you behaviour was non-deterministic:
Note The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Definitely !
import pyspark.sql.functions as F
partition = Window.partitionBy("column1").orderBy("columnN")
data = data.withColumn("max_col2", F.max(F.col("column2")).over(partition))\
.withColumn("first_col3", F.first(F.col("column3")).over(partition))\
.withColumn("last_col4", F.last(F.col("column4")).over(partition))
data.show(10, False)

Dataframe withColumn function

I know general structure of how to use withColumn function with a DataFrame like
df = df.withColumn("new_column_name", ((df.someColumn > someValue) & (df.someColumn < someOtherValue)))
Lets say, now, that the operator information (> and < in the above example) is stored as string(inputted by user). How can I perform above kind of operations? One naive way I can think of is to write many if else blocks one for each kind of operation and whenever we want to add new operation then we would have to add more if else blocks.
What obvious tweaks am I missing here?
Thanks in advance.

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

Spark Dataframe / SQL - Complex enriching nested data

Context
I have an example of event source data in a dataframe input as shown below.
SOURCE
where eventOccurredTime is a String type. This is from the source and I want to retain this in its original string form (with nano sec)
And I want to use the string to enrich some extra date/time typed data for downstream usage. below is an example
TARGET
Now as a one off I can execute some spark sql on the dataframe as shown below to get the result I want:
import org.apache.spark.sql.DataFrame
def transformDF(): DataFrame = {
spark.sql(
s"""
SELECT
id,
struct(
event.eventCategory,
event.eventName,
event.eventOccurredTime,
struct (
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP) AS eventOccurredTimestampUTC,
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS DATE) AS eventOccurredDateUTC,
unix_timestamp(substring(event.eventOccurredTime,1,23),"yyyy-MM-dd'T'HH:mm:ss.SSS") * 1000 AS eventOccurredTimestampMillis,
datesDim.dateSeq AS eventOccurredDateDimSeq
) AS eventOccurredTimeDim,
NOTE: This is a snippet, for the full event, I have to do this explicitly in this long SQL 20 times for the 20 string dates
Some things to point out:
unix_timestamp(substring(event.eventOccurredTime,1,23)
Above I found I had to substring a date that had nano precision or would return null, hence the substring
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
above is the pattern / naming convention for the 4 nested xDim struct fields to derive and they are present in the predefined spark schema the json is read using to create the source dataframe.
datesDim.dateSeq AS eventOccurredDateDimSeq
To get the above 'eventOccurredDateDimSeq' field, I need to join to a dates dimensions table 'datesDim' (static with an hourly grain), where dateSeq is the 'key' where this date falls into an hourly bucket where datesDim.UTC is defined to the hour
LEFT OUTER JOIN datesDim ON
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:00:00") AS TIMESTAMP) = datesDim.UTC
The table is globally available in the spark cluster so should be quick to look up, but I need to do this for every date enrichment in the payloads and they will have different dates.
dateDimensionDF.write.mode("overwrite").saveAsTable("datesDim")
The general schema pattern is that if there is a string date whose field name is:
x
..there is a 'xDim' struct equiv that immediately follows it in schema order below as described.
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
As mentioned with the snippet, although in the image above I am only showing 'eventOccuredTime' in above, there are more of these through the schema, at lower levels too, that need the same transformation pattern applied.
Problem:
So I have the spark sql (the full monty the snippet came from) to do this one off for 1 event type and its a large, explicit SQL statement that applies the time functions and joins I showed), but here is my problem I need help with.
So I want to try and create a more generic, functionally orientated reusable solution, that traverses a nested dataframe and applies this transformation pattern as described above 'where it needs to'
How do define 'where it needs to'?
Perhaps the naming convention is a good start - traverse the DF, look for any struct fields that have the xDim ('Dim' suffix) pattern, and use the 'x' field presceding as the input, and populate the xDim.* values in line with the naming pattern as described?
How in a function to best join on the datesDim registered table (its static remember) so it performs?
Solution?
Think one or more UDF is needed (we use Scala), maybe by itself or as a fragment within SQL, but not sure. Ensuring the DatesDim lookup performs is key I think.
Or maybe there is another way?
Note: I am working with Dataframes / SparkSQL not Datasets, options for each welcomed though?
Databricks
NOTE: Im actually using the databricks platform for this, so for those verse in SQL 'Higher order functions' in Dbricks
https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html
....is there a slick option here using 'TRANSFORM' as a SQL HOF (might need to register a utility UDF and use this with transform perhaps)?
Awesome, thanks spark community for your help!!! Sorry this is a long post setting the scene.

mongodb mapreduce aggregation solution or merge into alternative

Below is the logic that I am trying to implement but I am finding it really difficult to figure out a way with MongoDB/ Node.js app
Data: country, state, val1
I need to compute mean and std. deviation using the below formula. I checked other stack overflow posts but the std dev formula that i am working is not the same:
for each row -> group by country, state
mean = sum(val1)/count ->
for each row ->
deviation += Math.pow((val1 - mean), 2)
for each row -> group by country, state
std dev = Math.sqrt(dev/ count)
the problem is with the way deviation needs to be computed. It looks like I need an aggregation for Mean before computing the deviation/ std dev through Map reduce which I dont find a way to compute. Could anyone suggest a way to do this?
If it is not possible, do we have a way to issue an update statement in mongodb similar to the below traditional merge query? I shall update the mean value for all the rows and would later invoke Mapreduce for the deviation/std dev.
merge into Tbl1 a using
(select b.country, b.state, sum(b.val1)/count(b.val1) as mean
from Tbl1 b
group by b.country, b.state) c
on (a.country = c.country and
a.state = c.state)
when matched
then update
set a.mean = c.mean
I am pretty new to the nosql and nodejs and it would be great if you guys could suggest a solution/ alternative.
Yes, computing standard deviation using map-reduce is tricky as you need to compare each data value to the mean in the traditional algorithm.
Take a look at this solution based upon the parallel calculation algorithm: https://gist.github.com/RedBeard0531/1886960

Resources