How to mark overlapping time range in PySpark dataframe?

How to mark overlapping time range in PySpark dataframe? - apache-spark

I want to mark rows where start and end time overlaps based on keys.
For example, if given a dataframe like:
+---+-------------------+-------------------+
|key|start_date |end_date |
+---+-------------------+-------------------+
|A |2022-01-11 00:00:00|8888-12-31 00:00:00|
|B |2020-01-01 00:00:00|2022-02-10 00:00:00|
|B |2019-02-08 00:00:00|2020-02-15 00:00:00|
|B |2022-02-16 00:00:00|2022-12-15 00:00:00|
|C |2018-01-01 00:00:00|2122-02-10 00:00:00|
+---+-------------------+-------------------+
the resulting dataframe would have the first and second B records flagged, since their start and end times overlap. Like this:
+---+-------------------+-------------------+-----+
|key|start_date |end_date |valid|
+---+-------------------+-------------------+-----+
|A |2022-01-11 00:00:00|8888-12-31 00:00:00|true |
|B |2020-01-01 00:00:00|2022-02-10 00:00:00|false|
|B |2019-02-08 00:00:00|2020-02-15 00:00:00|false|
|B |2022-02-16 00:00:00|2022-12-15 00:00:00|true |
|C |2018-01-01 00:00:00|2122-02-10 00:00:00|true |
+---+-------------------+-------------------+-----+

Here I've added scripts to combine overlapping date ranges. In your case, I had modified the last script slightly - instead of final groupBy for overlapping ranges, I have added a window function which just flags them.
Test input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', '2022-01-11 00:00:00', '8888-12-31 00:00:00'),
('B', '2020-01-01 00:00:00', '2022-02-10 00:00:00'),
('B', '2019-02-08 00:00:00', '2020-02-15 00:00:00'),
('B', '2022-02-16 00:00:00', '2022-12-15 00:00:00'),
('C', '2018-01-01 00:00:00', '2122-02-10 00:00:00')],
['key', 'start_date', 'end_date'])
Script:
w1 = W.partitionBy("key").orderBy("start_date")
w2 = W.partitionBy("key", "contiguous_grp")
max_end = F.max("end_date").over(w1)
contiguous = F.when(F.datediff(F.lag(max_end).over(w1), "start_date") < 0, 1).otherwise(0)
df = (df
.withColumn("contiguous_grp", F.sum(contiguous).over(w1))
.withColumn("valid", (F.count(F.lit(1)).over(w2)) == 1)
.drop("contiguous_grp")
)
df.show()
# +---+-------------------+-------------------+-----+
# |key| start_date| end_date|valid|
# +---+-------------------+-------------------+-----+
# | A|2022-01-11 00:00:00|8888-12-31 00:00:00| true|
# | B|2019-02-08 00:00:00|2020-02-15 00:00:00|false|
# | B|2020-01-01 00:00:00|2022-02-10 00:00:00|false|
# | B|2022-02-16 00:00:00|2022-12-15 00:00:00| true|
# | C|2018-01-01 00:00:00|2122-02-10 00:00:00| true|
# +---+-------------------+-------------------+-----+

df = (df.select('key',*[to_date(x).alias(x) for x in df.columns if x!='key'])#Coerce to dates
.withColumn('valid', collect_list(array(col('start_date'), col('end_date'))).over(Window.partitionBy('key')))#Create list of start_end dates intervals
.withColumn('valid',expr("array_contains(transform(valid,(x,i)->start_date<(x[0])),true)"))#check if the start date occurs before end date, if true flag
)
+---+----------+----------+-----+
|key|start_date|end_date |valid|
+---+----------+----------+-----+
|A |2022-01-11|8888-12-31|false|
|B |2020-01-01|2022-02-10|true |
|B |2019-02-08|2020-02-15|true |
|B |2022-02-16|2022-12-15|false|
|C |2018-01-01|2122-02-10|false|
+---+----------+----------+-----+

Related

Employing Pyspark How to determine the frequency of each event and its event-by-event frequency

I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1

You can achieve your desired result by doing this,
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
print("Original Data:")
df.show()
print("Result:")
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
) \
.withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
.withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
.orderBy("ID")\
.drop("ID", "group", "event_freq", "element_freq")\
.show()
Original Data:
+----+
|Data|
+----+
| a|
| a|
| a|
| a|
| a|
| b|
| b|
| b|
| a|
| a|
| b|
+----+
Result:
+----+---------+
|Data|Frequency|
+----+---------+
| a| a1,1|
| a| a1,2|
| a| a1,3|
| a| a1,4|
| a| a1,5|
| b| b1,1|
| b| b1,2|
| b| b1,3|
| a| a2,1|
| a| a2,2|
| b| b2,1|
+----+---------+

Use Window functions. I give you to options just in case
Option 1, separating groups and Frequency
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('group', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('group', when(col('data')!=col('group'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('group', concat('Data',sum('group').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', rank().over(Window.partitionBy('group').orderBy('index')))
).sort('index').drop('index').show(truncate=False)
+----+-----+---------+
|Data|group|Frequency|
+----+-----+---------+
|a |a1 |1 |
|a |a1 |2 |
|a |a1 |3 |
|a |a1 |4 |
|a |a1 |5 |
|b |b2 |1 |
|b |b2 |2 |
|b |b2 |3 |
|a |a2 |1 |
|a |a2 |2 |
|b |b3 |1 |
+----+-----+---------+
Option 2 gives an output you wanted
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('Frequency', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('Frequency', when(col('data')!=col('Frequency'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('Frequency', concat('Data',sum('Frequency').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', array_join(array('Frequency',rank().over(Window.partitionBy('Frequency').orderBy('index'))),','))
).sort('index').drop('index').show(truncate=False)
+----+---------+
|Data|Frequency|
+----+---------+
|a |a1,1 |
|a |a1,2 |
|a |a1,3 |
|a |a1,4 |
|a |a1,5 |
|b |b2,1 |
|b |b2,2 |
|b |b2,3 |
|a |a2,1 |
|a |a2,2 |
|b |b3,1 |
+----+---------+

What is the best way to fill missing info on all columns with Null\0 for missing records in Spark dataframe while groupby?

Let's say I have the following Spark frame:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-13|1 |1 |0 |
+--------+----------+-----------+-------------------+-------------------+
Now I want to not only impute the missing dates in date column with the right dates so that dataframe keeps its continuous time-series nature and equally sequenced frame but also impute other columns with Null or 0 (while groupBy preferably).
My code is below:
import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType
dict2 = [("2021-08-11 04:05:06", "A"),
("2021-08-11 04:15:06", "B"),
("2021-08-11 09:15:26", "A"),
("2021-08-11 11:04:06", "B"),
("2021-08-11 14:55:16", "A"),
("2021-08-13 04:12:11", "B"),
]
schema = StructType([
StructField("timestamp", StringType(), True), \
StructField("UserName", StringType(), True), \
])
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
#sdf.printSchema()
#sdf.show(truncate=False)
#+-------------------+--------+
#|timestamp |UserName|
#+-------------------+--------+
#|2021-08-11 04:05:06|A |
#|2021-08-11 04:15:06|B |
#|2021-08-11 09:15:26|A |
#|2021-08-11 11:04:06|B |
#|2021-08-11 14:55:16|A |
#|2021-08-13 04:12:11|B |
#+-------------------+--------+
#Generate date and timestamp
sdf1 = sdf.withColumn('timestamp', F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withColumn('date', F.to_date("timestamp", "yyyy-MM-dd").cast(DateType())) \
.select('timestamp', 'date', 'UserName')
#sdf1.show(truncate = False)
#+-------------------+----------+--------+
#|timestamp |date |UserName|
#+-------------------+----------+--------+
#|2021-08-11 04:05:06|2021-08-11|A |
#|2021-08-11 04:15:06|2021-08-11|B |
#|2021-08-11 09:15:26|2021-08-11|A |
#|2021-08-11 11:04:06|2021-08-11|B |
#|2021-08-11 14:55:16|2021-08-11|A |
#|2021-08-13 04:12:11|2021-08-13|B |
#+-------------------+----------+--------+
#Aggeragate records numbers for specific features (Username) for certain time-resolution PerDay(24hrs), HalfDay(2x12hrs)
df = sdf1.groupBy("UserName", "date").agg(
F.sum(F.hour("timestamp").between(0, 24).cast("int")).alias("NoLogPerDay"),
F.sum(F.hour("timestamp").between(0, 11).cast("int")).alias("NoLogPer-1st-12-hrs"),
F.sum(F.hour("timestamp").between(12, 23).cast("int")).alias("NoLogPer-2nd-12-hrs"),
).sort('date')
df.show(truncate = False)
The problem is when I groupBy on date and UserName, I missed some dates which user B had activities but user A not or vice versa. So I'm interested in reflecting these no activities in the Spark dataframe by refilling those dates (no need to timestamp) and allocating 0 to those columns. I'm not sure if I can do this while grouping or before or after!
I already checked some related post as well as PySpark offers window functions and inspired this answer so until now I've tried this:
# compute the list of all dates from available dates
max_date = sdf1.select(F.max('date')).first()['max(date)']
min_date = sdf1.select(F.min('date')).first()['min(date)']
print(min_date) #2021-08-11
print(max_date) #2021-08-13
#compute list of available dates based on min_date & max_date from available data
dates_list = [max_date - dt.timedelta(days=x) for x in range((max_date - min_date).days +1)]
print(dates_list)
#create a temporaray Spark dataframe for date column includng missing dates with interval 1 day
sqlCtx = SQLContext(sc)
df2 = sqlCtx.createDataFrame(data=dates_list)
#Apply leftouter join on date column
dff = df2.join(sdf1, ["date"], "leftouter")
#dff.sort('date').show(truncate = False)
#possible to use .withColumn().otherwise()
#.withColumn('date',when(col('date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('date')))
#Replace 0 for null for all integer columns
dfff = dff.na.fill(value=0).sort('date')
dfff.select('date','Username', 'NoLogPerDay','NoLogPer-1st-12-hrs','NoLogPer-2nd-12-hrs').sort('date').show(truncate = False)
Please note that I'm not interested in using UDF or hacking it via toPandas()
so expected results should be like below after groupBy:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 | <--
|A |2021-08-12|0 |0 |0 | <--
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 | <--
+--------+----------+-----------+-------------------+-------------------+

Here's is one way of doing:
First, generate new dataframe all_dates_df that contains the sequence of the dates from min to max date in your grouped df. For this you can use sequence function:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(date), max(date), interval 1 day) as date"
).select(F.explode("date").alias("date"))
all_dates_df.show()
#+----------+
#| date|
#+----------+
#|2021-08-11|
#|2021-08-12|
#|2021-08-13|
#+----------+
Now, you need to duplicate each date for all the users using a cross join with distinct UserName dataframe and finally join with the grouped df to get the desired output:
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["UserName", "date"],
"left"
).fillna(0)
result_df.show()
#+--------+----------+-----------+-------------------+-------------------+
#|UserName| date|NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
#+--------+----------+-----------+-------------------+-------------------+
#| A|2021-08-11| 3| 2| 1|
#| B|2021-08-11| 2| 2| 0|
#| A|2021-08-12| 0| 0| 0|
#| B|2021-08-12| 0| 0| 0|
#| B|2021-08-13| 1| 1| 0|
#| A|2021-08-13| 0| 0| 0|
#+--------+----------+-----------+-------------------+-------------------+

Essentially, you may generate all the possible options and left join on this to achieve your missing date.
The sequence sql function may be helpful here to generate all your possible dates. You may pass it your min and max date along with your interval you would like it to increment by. The following examples continue with the code on your google collab.
Using the functions min,max,collect_set and table generating functions explode you may achieve the following:
possible_user_dates=(
# Step 1 - Get all possible UserNames and desired dates
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
)
# Step 2 - Use explode to split the collected arrays into rows (ouput immediately below)
.withColumn("UserName",F.explode("UserName"))
.withColumn("date",F.explode("date"))
.distinct()
)
possible_user_dates.show(truncate=False)
+--------+----------+
|UserName|date |
+--------+----------+
|B |2021-08-11|
|A |2021-08-11|
|B |2021-08-12|
|A |2021-08-12|
|B |2021-08-13|
|A |2021-08-13|
+--------+----------+
Performing your left join
final_df = (
possible_user_dates.join(
df,
["UserName","date"],
"left"
)
# Since the left join will place NULLs where values are missing.
# Eg. where a User was not active on a particular date
# We use `fill` to replace the null values with `0`
.na.fill(0)
)
final_df.show(truncate=False)
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 |
|A |2021-08-12|0 |0 |0 |
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 |
+--------+----------+-----------+-------------------+-------------------+
For debugging purposes, I've included the output of a few intermediary steps
Step 1 Output:
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
).show(truncate=False)
+--------+------------------------------------+
|UserName|date |
+--------+------------------------------------+
|[B, A] |[2021-08-11, 2021-08-12, 2021-08-13]|
+--------+------------------------------------+

How to remove NULL from a struct field in pyspark?

I have a DataFrame which contains one struct field. I want to remove the values which are null from the struct field.
temp_df_struct = Df.withColumn("VIN_COUNTRY_CD",struct('BXSR_VEHICLE_1_VIN_COUNTRY_CD','BXSR_VEHICLE_2_VIN_COUNTRY_CD','BXSR_VEHICLE_3_VIN_COUNTRY_CD','BXSR_VEHICLE_4_VIN_COUNTRY_CD','BXSR_VEHICLE_5_VIN_COUNTRY_CD'))
In these various columns some contains NULLs. Is there any way to remove null from the struct field?

You should always provide a small reproducible example - but here's my guess as to what you want
Example data
data = [("1", "10", "20", None, "30", "40"), ("2", None, "15", "25", "35", None)]
names_of_cols = [
"id",
"BXSR_VEHICLE_1_VIN_COUNTRY_CD",
"BXSR_VEHICLE_2_VIN_COUNTRY_CD",
"BXSR_VEHICLE_3_VIN_COUNTRY_CD",
"BXSR_VEHICLE_4_VIN_COUNTRY_CD",
"BXSR_VEHICLE_5_VIN_COUNTRY_CD",
]
df = spark.createDataFrame(data, names_of_cols)
df.show(truncate=False)
# +---+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
# | id|BXSR_VEHICLE_1_VIN_COUNTRY_CD|BXSR_VEHICLE_2_VIN_COUNTRY_CD|BXSR_VEHICLE_3_VIN_COUNTRY_CD|BXSR_VEHICLE_4_VIN_COUNTRY_CD|BXSR_VEHICLE_5_VIN_COUNTRY_CD|
# +---+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
# | 1| 10| 20| null| 30| 40|
# | 2| null| 15| 25| 35| null|
# +---+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
Reproducing what you have
You want to collect values from multiple columns into an array, such as
import re
from pyspark.sql.functions import col, array
collect_cols = [c for c in df.columns if re.match('BXSR_VEHICLE_\\d_VIN_COUNTRY_CD', c)]
collect_cols
# ['BXSR_VEHICLE_1_VIN_COUNTRY_CD', 'BXSR_VEHICLE_2_VIN_COUNTRY_CD', 'BXSR_VEHICLE_3_VIN_COUNTRY_CD', 'BXSR_VEHICLE_4_VIN_COUNTRY_CD', 'BXSR_VEHICLE_5_VIN_COUNTRY_CD']
(
df.
withColumn(
"VIN_COUNTRY_CD",
array(*collect_cols)
).
select('id', 'VIN_COUNTRY_CD').
show(truncate=False)
)
# +---+-----------------+
# |id |VIN_COUNTRY_CD |
# +---+-----------------+
# |1 |[10, 20,, 30, 40]|
# |2 |[, 15, 25, 35,] |
# +---+-----------------+
Solution
And then remove NULLs from the array
from pyspark.sql.functions import array, struct, lit, array_except
(
df.
withColumn(
"VIN_COUNTRY_CD",
array(*collect_cols)
).
withColumn(
'VIN_COUNTRY_CD',
array_except(
col('VIN_COUNTRY_CD'),
array(lit(None).cast('string'))
)
).
select('id', 'VIN_COUNTRY_CD').
show(truncate=False)
)
# +---+----------------+
# |id |VIN_COUNTRY_CD |
# +---+----------------+
# |1 |[10, 20, 30, 40]|
# |2 |[15, 25, 35] |
# +---+----------------+

Spark struct represented by OneHotEncoder

I have a data frame with two columns,
+---+-------+
| id| fruit|
+---+-------+
| 0| apple|
| 1| banana|
| 2|coconut|
| 1| banana|
| 2|coconut|
+---+-------+
also I have a universal List with all the items,
fruitList: Seq[String] = WrappedArray(apple, coconut, banana)
now I want to create a new column in the dataframe with an array of 1's,0's, where 1 represent the item exist and 0 if the item doesn't present for that row.
Desired Output
+---+-----------+
| id| fruitlist|
+---+-----------+
| 0| [1,0,0] |
| 1| [0,1,0] |
| 2|[0,0,1] |
| 1| [0,1,0] |
| 2|[0,0,1] |
+---+-----------+
This is something I tried,
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = spark.createDataFrame(Seq(
(0, "apple"),
(1, "banana"),
(2, "coconut"),
(1, "banana"),
(2, "coconut")
)).toDF("id", "fruit")
df.show
import org.apache.spark.sql.functions._
val fruitList = df.select(collect_set("fruit")).first().getAs[Seq[String]](0)
print(fruitList)
I tried to solve this with OneHotEncoder but the result was something like this after converting to dense vector, which is not what I needed.
+---+-------+----------+-------------+---------+
| id| fruit|fruitIndex| fruitVec| vd|
+---+-------+----------+-------------+---------+
| 0| apple| 2.0| (2,[],[])|[0.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
+---+-------+----------+-------------+---------+

If you have a collection as
val fruitList: Seq[String] = Array("apple", "coconut", "banana")
Then you can either do it using inbuilt functions or udf function
inbuilt functions (array, when and lit)
import org.apache.spark.sql.functions._
df.withColumn("fruitList", array(fruitList.map(x => when(lit(x) === col("fruit"),1).otherwise(0)): _*)).show(false)
udf function
import org.apache.spark.sql.functions._
def containedUdf = udf((fruit: String) => fruitList.map(x => if(x == fruit) 1 else 0))
df.withColumn("fruitList", containedUdf(col("fruit"))).show(false)
which should give you
+---+-------+---------+
|id |fruit |fruitList|
+---+-------+---------+
|0 |apple |[1, 0, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
+---+-------+---------+
udf functions are easy to understand and straight forward, dealing with primitive datatypes but should be avoided if optimized and fast inbuilt functions are available to do the same task
I hope the answer is helpful

Sorting DataFrame within rows and getting the ranking

I have the following PySpark DataFrame :
+----+----------+----------+----------+
| id| a| b| c|
+----+----------+----------+----------+
|2346|2017-05-26| null|2016-12-18|
|5678|2013-05-07|2018-05-12| null|
+----+----------+----------+----------+
My ideal output is :
+----+---+---+---+
|id |a |b |c |
+----+---+---+---+
|2346|2 |0 |1 |
|5678|1 |2 |0 |
+----+---+---+---+
Ie the more recent the date within the row, the higher the score
I have looked at similar posts suggesting to use window function. The problem is that I need to order my values within the row, not the column.

You can put the values in each row into an array and use pyspark.sql.functions.sort_array() to sort it.
import pyspark.sql.functions as f
cols = ["a", "b", "c"]
df = df.select("*", f.sort_array(f.array([f.col(c) for c in cols])).alias("sorted"))
df.show(truncate=False)
#+----+----------+----------+----------+------------------------------+
#|id |a |b |c |sorted |
#+----+----------+----------+----------+------------------------------+
#|2346|2017-05-26|null |2016-12-18|[null, 2016-12-18, 2017-05-26]|
#|5678|2013-05-07|2018-05-12|null |[null, 2013-05-07, 2018-05-12]|
#+----+----------+----------+----------+------------------------------+
Now you can use a combination of pyspark.sql.functions.coalesce() and pyspark.sql.functions.when() to loop over each of the columns in cols and find the corresponding index in the sorted array.
df = df.select(
"id",
*[
f.coalesce(
*[
f.when(
f.col("sorted").getItem(i) == f.col(c),
f.lit(i)
)
for i in range(len(cols))
]
).alias(c)
for c in cols
]
)
df.show(truncate=False)
#+----+---+----+----+
#|id |a |b |c |
#+----+---+----+----+
#|2346|2 |null|1 |
#|5678|1 |2 |null|
#+----+---+----+----+
Finally fill the null values with 0:
df = df.na.fill(0)
df.show(truncate=False)
#+----+---+---+---+
#|id |a |b |c |
#+----+---+---+---+
#|2346|2 |0 |1 |
#|5678|1 |2 |0 |
#+----+---+---+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to mark overlapping time range in PySpark dataframe? - apache-spark

Related

Employing Pyspark How to determine the frequency of each event and its event-by-event frequency

What is the best way to fill missing info on all columns with Null\0 for missing records in Spark dataframe while groupby?

How to remove NULL from a struct field in pyspark?

Spark struct represented by OneHotEncoder

Sorting DataFrame within rows and getting the ranking

Categories

Resources