Pyspark - Creating a dataframe by user defined aggregate function and pivoting

Pyspark - Creating a dataframe by user defined aggregate function and pivoting - apache-spark

I need to write a user defined aggregate function that captures the number of days between previous discharge_date and following admit_date for each consecutive visits.
I will also need to pivot on the "PERSON_ID" values.
I have the following input_df :
input_df :
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-03-15| 2018-03-16|
| 333|2018-06-10| 2018-06-11|
| 111|2018-03-01| 2018-03-02|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 111|2018-03-30| 2018-03-31|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-20| 2018-06-21|
| 111|2018-01-01| 2018-01-02|
+---------+----------+--------------+
First, I need to group by each person and sort the corresponding rows by ADMIT_DATE. That would yield "input_df2".
input_df2:
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-01-01| 2018-01-03|
| 111|2018-03-01| 2018-03-02|
| 111|2018-03-15| 2018-03-16|
| 111|2018-03-30| 2018-03-31|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-10| 2018-06-11|
| 333|2018-06-20| 2018-06-21|
+---------+----------+--------------+
The desired output_df :
+------------------+-----------------+-----------------+----------------+
|PERSON_ID_DISTINCT| FIRST_DIFFERENCE|SECOND_DIFFERENCE|THIRD_DIFFERENCE|
+------------------+-----------------+-----------------+----------------+
| 111| 1 month 26 days | 13 days| 14 days|
| 222| 3 days| NAN| NAN|
| 333| 8 days| 9 days| NAN|
+------------------+-----------------+-----------------+----------------+
I know the maximum number a person appears in my input_df, so I know how many columns should be created by :
print input_df.groupBy('PERSON_ID').count().sort('count', ascending=False).show(5)
Thanks a lot in advance,

You can use pyspark.sql.functions.datediff() to compute the difference between two dates in days. In this case, you just need to compute the difference between the current row's ADMIT_DATE and the previous row's DISCHARGE_DATE. You can do this by using pyspark.sql.functions.lag() over a Window.
For example, we can compute the duration between visits in days as a new column DURATION.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('PERSON_ID').orderBy('ADMIT_DATE')
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w)-1)\
.sort('PERSON_ID', 'INDEX')\
.show()
#+---------+----------+--------------+--------+-----+
#|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|DURATION|INDEX|
#+---------+----------+--------------+--------+-----+
#| 111|2018-01-01| 2018-01-02| null| 0|
#| 111|2018-03-01| 2018-03-02| 58| 1|
#| 111|2018-03-15| 2018-03-16| 13| 2|
#| 111|2018-03-30| 2018-03-31| 14| 3|
#| 222|2018-12-01| 2018-12-02| null| 0|
#| 222|2018-12-05| 2018-12-06| 3| 1|
#| 333|2018-06-01| 2018-06-02| null| 0|
#| 333|2018-06-10| 2018-06-11| 8| 1|
#| 333|2018-06-20| 2018-06-21| 9| 2|
#+---------+----------+--------------+--------+-----+
Notice, I also added an INDEX column using pyspark.sql.functions.row_number(). We can just filter for INDEX > 0 (because the first value will always be null) and then just pivot the DataFrame:
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w) - 1)\
.where('INDEX > 0')\
.groupBy('PERSON_ID').pivot('INDEX').agg(f.first('DURATION'))\
.sort('PERSON_ID')\
.show()
#+---------+---+----+----+
#|PERSON_ID| 1| 2| 3|
#+---------+---+----+----+
#| 111| 58| 13| 14|
#| 222| 3|null|null|
#| 333| 8| 9|null|
#+---------+---+----+----+
Now you can rename the columns to whatever you desire.
Note: This assumes that ADMIT_DATE and DISCHARGE_DATE are of type date.
input_df.printSchema()
#root
# |-- PERSON_ID: long (nullable = true)
# |-- ADMIT_DATE: date (nullable = true)
# |-- DISCHARGE_DATE: date (nullable = true)

Related

Filling gaps in time series Spark for different entities

I have a data frame containing daily events related to various entities in time.
I want to fill the gaps in those times series.
Here is the aggregate data I have (left), and on the right side, the data I want to have:
+---------+----------+-------+ +---------+----------+-------+
|entity_id| date|counter| |entity_id| date|counter|
+---------+----------+-------+ +---------+----------+-------+
| 3|2020-01-01| 7| | 3|2020-01-01| 7|
| 1|2020-01-01| 10| | 1|2020-01-01| 10|
| 2|2020-01-01| 3| | 2|2020-01-01| 3|
| 2|2020-01-02| 9| | 2|2020-01-02| 9|
| 1|2020-01-03| 15| | 1|2020-01-02| 0|
| 2|2020-01-04| 3| | 3|2020-01-02| 0|
| 1|2020-01-04| 14| | 1|2020-01-03| 15|
| 2|2020-01-05| 6| | 2|2020-01-03| 0|
+---------+----------+-------+ | 3|2020-01-03| 0|
| 3|2020-01-04| 0|
| 2|2020-01-04| 3|
| 1|2020-01-04| 14|
| 2|2020-01-05| 6|
| 1|2020-01-05| 0|
| 3|2020-01-05| 0|
+---------+----------+-------+
I have used this stack overflow topic, which was very useful:
Filling gaps in timeseries Spark
Here is my code (filter for only one entity), it is in Python but I think the API is the same in Scala:
(
df
.withColumn("date", sf.to_date("created_at"))
.groupBy(
sf.col("entity_id"),
sf.col("date")
)
.agg(sf.count(sf.lit(1)).alias("counter"))
.filter(sf.col("entity_id") == 1)
.select(
sf.col("date"),
sf.col("counter")
)
.join(
spark
.range(
df # range start
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.min("created_at")).alias("min"))
.first().min // a * a, # a = 60 * 60 * 24 = seconds in one day
(df # range end
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.max("created_at")).alias("max"))
.first().max // a + 1) * a,
a # range step, a = 60 * 60 * 24 = seconds in one day
)
.select(sf.to_date(sf.from_unixtime("id")).alias("date")),
["date"], # column which will be used for the join
how="right" # type of join
)
.withColumn("counter", sf.when(sf.isnull("counter"), 0).otherwise(sf.col("counter")))
.sort(sf.col("date"))
.show(200)
)
This work very well, but now I want to avoid the filter and do a range to fill the time series gaps for every entity (entity_id == 2, entity_id == 3, ...). For your information, depending on the entity_id value, the minimum and the maximum of the column date can be different, nevertheless if your help involves the global minimum and maximum of the whole data frame, it is ok for me as well.
If you need any other information, feel free to ask.
edit: add data example I want to have

When creating the elements of the date range, I would rather use the Pandas function than the Spark range, as the Spark range function has some shortcomings when dealing with date values. The amount of different dates is usually small. Even when dealing with a time span of multiple years, the number of different dates is so small that it can be easily broadcasted in a join.
#get the minimun and maximun date and collect it to the driver
min_date, max_date = df.select(F.min("date"), F.max("date")).first()
#use Pandas to create all dates and switch back to PySpark DataFrame
from pandas import pandas as pd
timerange = pd.date_range(start=min_date, end=max_date, freq='1d')
all_dates = spark.createDataFrame(timerange.to_frame(),['date'])
#get all combinations of dates and entity_ids
all_dates_and_ids = all_dates.crossJoin(df.select("entity_id").distinct())
#create the final result by doing a left join and filling null values with 0
result = all_dates_and_ids.join(df, on=['date', 'entity_id'], how="left_outer")\
.fillna({'counter':'0'}) \
.orderBy(['date', 'entity_id'])
This gives
+-------------------+---------+-------+
| date|entity_id|counter|
+-------------------+---------+-------+
|2020-01-01 00:00:00| 1| 10|
|2020-01-01 00:00:00| 2| 3|
|2020-01-01 00:00:00| 3| 7|
|2020-01-02 00:00:00| 1| 0|
|2020-01-02 00:00:00| 2| 9|
|2020-01-02 00:00:00| 3| 0|
|2020-01-03 00:00:00| 1| 15|
|2020-01-03 00:00:00| 2| 0|
|2020-01-03 00:00:00| 3| 0|
|2020-01-04 00:00:00| 1| 14|
|2020-01-04 00:00:00| 2| 3|
|2020-01-04 00:00:00| 3| 0|
|2020-01-05 00:00:00| 1| 0|
|2020-01-05 00:00:00| 2| 6|
|2020-01-05 00:00:00| 3| 0|
+-------------------+---------+-------+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.

You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))

from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

Crossjoin between two dataframes that is dependent on a common column

A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?

A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5

I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

Getting the least set of rows in a groupby of a pyspark dataframe [duplicate]

This question already has answers here:
GroupBy column and filter rows with maximum value in Pyspark
(4 answers)
Closed 4 years ago.
I have a dataframe with values
#+-------+---------+-----+
#|name1 |name 2 |score|
#+-------+---------+-----+
#| abcdef| abcghi | 3|
#| abcdef| abcjkl | 3|
#| abcdef| abcyui | 3|
#| abcdef| abrtyu | 4|
#| pqrstu| pqrswe | 2|
#| pqrstu| pqrsqw | 2|
#| pqrstu| pqrzxc | 3|
#+-------+---------+-----+
I need to group by name1 and pick the rows with the least score.
I understand I can pick the top row after a groupby on name1 and sort the score in ascending order and pick the first row. I do this by
joined_windows = Window().partitionBy("name1").orderBy(col("score").asc())
result = joined_df.withColumn("rn", row_number().over(joined_windows)).where(col("rn") == 1).drop("rn")
But I want the dataframe to hold the following values (ie., set of rows with the least score in each group.
#+-------+---------+-----+
#|name1 |name 2 |score|
#+-------+---------+-----+
#| abcdef| abcghi | 3|
#| abcdef| abcjkl | 3|
#| abcdef| abcyui | 3|
#| pqrstu| pqrswe | 2|
#| pqrstu| pqrsqw | 2|
#+-------+---------+-----+

For hold several values such code can be used:
val joined_windows = Window.partitionBy("name1")
val result = df.withColumn("rn", min($"score").over(joined_windows))
result.where($"rn"===$"score").drop("rn").show(false)
Output:
+------+------+-----+
|name1 |name 2|score|
+------+------+-----+
|abcdef|abcghi|3 |
|abcdef|abcjkl|3 |
|abcdef|abcyui|3 |
|pqrstu|pqrswe|2 |
|pqrstu|pqrsqw|2 |
+------+------+-----+

You can group by two column:
df \
.groupBy('name1', 'name2') \
.agg(F.min('score'))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark - Creating a dataframe by user defined aggregate function and pivoting - apache-spark

Related

Filling gaps in time series Spark for different entities

How to do a conditional aggregation after a groupby in pyspark dataframe?

Crossjoin between two dataframes that is dependent on a common column

Keep track of the previous row values with additional condition using pyspark

Getting the least set of rows in a groupby of a pyspark dataframe [duplicate]

Categories

Resources