Can you tell Spark to calculate `when` function's arguments lazily? [duplicate] - apache-spark

I want to create a new boolean column in my dataframe that derives its value from the evaluation of two conditional statements on other columns in the same dataframe:
columns = ["id", "color_one", "color_two"]
data = spark.createDataFrame([(1, "blue", "red"), (2, "red", None)]).toDF(*columns)
data = data.withColumn('is_red', data.color_one.contains("red") | data.color_two.contains("red"))
This works fine unless either color_one or color_two is NULL in a row. In cases like these, is_red is also set to NULL for that row instead of true or false:
+-------+----------+------------+-------+
|id |color_one |color_two |is_red |
+-------+----------+------------+-------+
| 1| blue| red| true|
| 2| red| NULL| NULL|
+-------+----------+------------+-------+
This means that PySpark is evaluating all of the clauses of the conditional statement rather than exiting early (via short-circuit evaluation) if the first condition happens to be true (like in row 2 of my example above).
Does PySpark support the short-circuit evaluation of conditional statements?
In the meantime, here is a workaround I have come up with to null-check each column:
from pyspark.sql import functions as F
color_one_is_null = data.color_one.isNull()
color_two_is_null = data.color_two.isNull()
data = data.withColumn('is_red', F.when(color_two_is_null, data.color_one.contains("red"))
.otherwise(F.when(color_one_is_null, data.color_two.contains("red"))
.otherwise(F.when(color_one_is_null & color_two_is_null, F.lit(False))
.otherwise(data.color_one.contains("red") | data.color_two.contains("red"))))
)

I don't think Spark support short-circuit evaluation on conditionals as stated here https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#:~:text=Spark%20SQL%20(including,short-circuiting%E2%80%9D%20semantics.:
Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.
Another alternative way would be creating an array of column_one and column_two, then evaluate if the array contains 'red' using SQL EXISTS
data = data.withColumn('is_red', F.expr("EXISTS(array(color_one, color_two), x -> x = 'red')"))
data.show()
+---+---------+---------+------+
| id|color_one|color_two|is_red|
+---+---------+---------+------+
| 1| blue| red| true|
| 2| red| null| true|
| 3| null| green| false|
| 4| yellow| null| false|
| 5| null| red| true|
| 6| null| null| false|
+---+---------+---------+------+

Related

PySpark: Time since previous True

I have a Spark dataframe, like so:
# For sake of simplicity only one id is shown, but there are multiple objects
+---+-------------------+------+
| id| timstm|signal|
+---+-------------------+------+
| X1|2022-07-01 00:00:00| null|
| X1|2022-07-02 00:00:00| true|
| X1|2022-07-03 00:00:00| null|
| X1|2022-07-05 00:00:00| null|
| X1|2022-07-09 00:00:00| true|
+---+-------------------+------+
And I want to create a new column that contains the time since the signal column was last true
+---+-------------------+------+---------+
| id| timstm|signal|time_diff|
+---+-------------------+------+---------+
| X1|2022-07-01 00:00:00| null| null|
| X1|2022-07-02 00:00:00| true| 0.0|
| X1|2022-07-03 00:00:00| null| 1.0|
| X1|2022-07-05 00:00:00| null| 3.0|
| X1|2022-07-09 00:00:00| true| 0.0|
+---+-------------------+------+---------+
Any ideas how to approach this? My intuition is to somehow use window and filter to achieve this, but I'm not sure
So this logic is a bit hard to express in native PySpark. It might be easier to express it as a pandas_udf. I will use the Fugue library to bring Python/Pandas code to a Pandas UDF, but if you don't want to use Fugue, you can still bring it to Pandas UDF, it just takes a lot more code.
Setup
Here I am just creating the DataFrame in the example. I know this is a Pandas DataFrame, we will convert it to Spark and run the solution on Spark later.
I suggest filling the null with False in the original DataFrame. This is because the Pandas code uses a group-by and NULL values are dropped by default in Pandas groupby. Filling the NULL with False will make it work properly (and I think it's also easier for conversion between Spark and Pandas).
import pandas as pd
df = pd.DataFrame({"id": ["X1"]*5,
"timestm": ["2022-07-01", "2022-07-02", "2022-07-03", "2022-07-05", "2022-07-09"],
"signal": [None, True, None, None, True]})
df['timestm'] = pd.to_datetime(df['timestm'])
df['signal'] = df['signal'].fillna(False)
Solution 1
So when we use Pandas-UDF, the important piece is that the function is applied per Spark partition. So the function just needs to be able to handle one id. And then we partition the Spark DataFrame by id and run the function for each one later.
Also be aware that order may not be guaranteed so we'll sort the data by time as the first step. The Pandas code I have is really just taken from another post here and modified.
def process(df: pd.DataFrame) -> pd.DataFrame:
df = df.sort_values('timestm')
df['days_since_last_event'] = df['timestm'].diff().apply(lambda x: x.days)
df.loc[:, 'days_since_last_event'] = df.groupby(df['signal'].shift().cumsum())['days_since_last_event'].cumsum()
df.loc[df['signal'] == True, 'days_since_last_event'] = 0
return df
process(df)
This will give us:
id timestm signal days_since_last_event
X1 2022-07-01 False NaN
X1 2022-07-02 True 0.0
X1 2022-07-03 False 1.0
X1 2022-07-05 False 3.0
X1 2022-07-09 True 0.0
Which looks right. Now we can bring it to Spark using Fugue with minimal additional lines of code. This will partition the data and run the function on each partition. Schema is a requirement for Pandas UDF so Fugue needs it also, but uses a simpler way to define it.
import fugue.api as fa
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
out = fa.transform(sdf, process, schema="*, days_since_last_event:int", partition={"by": "id"})
# out is a Spark DataFrame because a Spark DataFrame was passed in
out.show()
which gives us:
+---+-------------------+------+---------------------+
| id| timestm|signal|days_since_last_event|
+---+-------------------+------+---------------------+
| X1|2022-07-01 00:00:00| false| null|
| X1|2022-07-02 00:00:00| true| 0|
| X1|2022-07-03 00:00:00| false| 1|
| X1|2022-07-05 00:00:00| false| 3|
| X1|2022-07-09 00:00:00| true| 0|
+---+-------------------+------+---------------------+
Note to define the partition when running on the full data.

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

How to rename duplicated columns after join? [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.

spark concatenate data frames and merge schema

I have several data frames in spark with partly similar schema (header) int the beginning and different columns (custom) in the end.
case class First(header1:String, header2:String, header3:Int, custom1:String)
case class Second(header1:String, header2:String, header3:Int, custom1:String, custom5:String)
case class Third(header1:String, header2:String, header3:Int, custom2:String, custom3:Int, custom4:Double)
val first = Seq(First("A", "Ba1", 1, "custom1"), First("A", "Ba2", 2, "custom2")).toDS
val second = Seq(Second("B", "Bb1", 1, "custom12", "custom5"), Second("B", "Bb2", 22, "custom12", "custom55")).toDS
val third = Seq(Third("A", "Bc1", 1, "custom2", 22, 44.4)).toDS
This could look like:
+-------+-------+-------+-------+
|header1|header2|header3|custom1|
+-------+-------+-------+-------+
| A| Ba1| 1|custom1|
| A| Ba2| 2|custom2|
+-------+-------+-------+-------+
+-------+-------+-------+--------+--------+
|header1|header2|header3| custom1| custom5|
+-------+-------+-------+--------+--------+
| B| Bb1| 1|custom12| custom5|
| B| Bb2| 22|custom12|custom55|
+-------+-------+-------+--------+--------+
+-------+-------+-------+-------+-------+-------+
|header1|header2|header3|custom2|custom3|custom4|
+-------+-------+-------+-------+-------+-------+
| A| Bc1| 1|custom2| 22| 44.4|
+-------+-------+-------+-------+-------+-------+
How can I merge the schema to basically concatenate all the dataframes into a single schema
case class All(header1:String, header2:String, header3:Int, custom1:Option[String], custom3:Option[String],
custom4: Option[Double], custom5:Option[String], type:String)
where some columns which are not present will be nullable?
Output should should look like this in case of the first record from data frame named first
+-------+-------+-------+-------+-------+-------+-------+-------+
|header1|header2|header3|custom1|custom2|custom3|custom4|custom5|
+-------+-------+-------+-------+-------+-------+-------+-------+
| A| B| 1|custom1|Nan |Nan | Nan| Nan. |
+-------+-------+-------+-------+-------+-------+-------+-------+
I was thinking about joining the data frames via the header columns, however,only some (lets say header1) would hold the same (actually joinable) values and the others (header2,3) would hold different values i.e.
first
.join(second, Seq("header1", "header2", "header3"), "LEFT")
.join(third, Seq("header1", "header2", "header3"), "LEFT")
.show
resulting in
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|header1|header2|header3|custom1|custom1|custom5|custom2|custom3|custom4|
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| A| Ba1| 1|custom1| null| null| null| null| null|
| A| Ba2| 2|custom2| null| null| null| null| null|
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
is not correct as I just want to pd.Concat(axis=0) the dataFrames i.e. am lacking most of the records.
Also it would be lacking a type column identifying the original data frame i.e. first, second, third
edit
I think a classical full outer join is the solution
first
.join(second, Seq("header1", "header2", "header3"), "fullouter")
.join(third, Seq("header1", "header2", "header3"), "fullouter")
.show
yields:
+-------+-------+-------+-------+--------+--------+-------+-------+-------+
|header1|header2|header3|custom1| custom1| custom5|custom2|custom3|custom4|
+-------+-------+-------+-------+--------+--------+-------+-------+-------+
| A| Ba1| 1|custom1| null| null| null| null| null|
| A| Ba2| 2|custom2| null| null| null| null| null|
| A| Bb1| 1| null|custom12| custom5| null| null| null|
| A| Bb2| 22| null|custom12|custom55| null| null| null|
| A| Bc1| 1| null| null| null|custom2| 22| 44.4|
+-------+-------+-------+-------+--------+--------+-------+-------+-------+
As you see, actually there will never be a real join, rows are concatenated. Is there a simpler operation to achieve the same functionality?
This answer is not optimal, as custom1 is a duplicate name. I rather would want to see a single custom1 column (with no null values if there is a second one to fill).
Check out my comment to similar question. Basically you need to union all the frames. To make similar schema you need to use dataframe.withColumn(ColumnName, expr("null")) expression:
import org.apache.spark.sql.functions._
val first1 = first.withColumn("custom5", expr("null"))
.withColumn("custom4", expr("null"))
val second2 = second.withColumn("custom4", expr("null"))
val result = first1.unionAll(second2).unionAll(third)
Please test the SQL Union approach if it provides the desired result.
SELECT header1,
header2,
header3,
custom1,
To_char(NULL) "custom2",
To_char(NULL) "custom3",
To_number(NULL) "custom4",
To_char(NULL) "custom5"
FROM table1
UNION
SELECT header1,
header2,
header3,
custom1,
To_char(NULL) "custom2",
To_char(NULL) "custom3",
To_number(NULL) "custom4",
custom5
FROM table2
UNION
SELECT header1,
header2,
header3,
To_char(NULL) "custom1",
custom2,
custom3,
custom4,
To_char(NULL) "custom5"
FROM table3;
If you are writing files to HDFS then you can achieve this by setting following property Spark.sql.parquet.mergeSchema to TRUE and write files to HDFS location.
It automatically updates the schema and returns all columns.
You can achieve this using below ways
withColumn and union
Specify schema before itself and perform union
spark.conf.set("spark.sql.parquet.mergeSchema","true")
eb = spark.read.format("csv").schema(schem).option("path","/retail/ebay.csv").load()
eb.printSchema()
eb.write.format("parquet").mode("append").save("/retail/parquet_test")
from pyspark.sql.functions import lit
eb1 = eb.withColumn("dummy",lit(35))
eb1.printSchema()
eb1.write.format("parquet").mode("append").save("/retail/parquet_test")
eb2 = spark.read.parquet("/srinchin/parquet_test")
eb2.printSchema()

Resources