Spark find NULL value blocks in Series of values

Spark find NULL value blocks in Series of values - apache-spark

Assume this is my data:
date value
2016-01-01 1
2016-01-02 NULL
2016-01-03 NULL
2016-01-04 2
2016-01-05 3
2016-01-06 NULL
2016-01-07 NULL
2016-01-08 NULL
2016-01-09 1
I am trying to find the start and end dates that surround the NULL-value groups. An example output would be as follows:
start end
2016-01-01 2016-01-04
2016-01-05 2016-01-09
My first attempt at the problem produced the following:
df.filter($"value".isNull)\
.agg(to_date(date_add(max("date"), 1)) as "max",
to_date(date_sub(min("date"),1)) as "min"
)
but this only finds the total min and max value. I thought of using groupBy but don't know how to create a column for each of the null value blocks.

The tricky part is to get the borders of the groups, therefore you need several steps.
first to build groups of nulls/not-nulls (using window-functions)
then group by blocks to get the borders within the blocks
then again window-function to extend the borders
Here a working example:
import ss.implicits._
val df = Seq(
("2016-01-01", Some(1)),
("2016-01-02", None),
("2016-01-03", None),
("2016-01-04", Some(2)),
("2016-01-05", Some(3)),
("2016-01-06", None),
("2016-01-07", None),
("2016-01-08", None),
("2016-01-09", Some(1))
).toDF("date", "value")
df
// build blocks
.withColumn("isnull", when($"value".isNull, true).otherwise(false))
.withColumn("lag_isnull", lag($"isnull",1).over(Window.orderBy($"date")))
.withColumn("change", coalesce($"isnull"=!=$"lag_isnull",lit(false)))
.withColumn("block", sum($"change".cast("int")).over(Window.orderBy($"date")))
// now calculate min/max within groups
.groupBy($"block")
.agg(
min($"date").as("tmp_min"),
max($"date").as("tmp_max"),
(count($"value")===0).as("null_block")
)
// now extend groups to include borders
.withColumn("min", lag($"tmp_max", 1).over(Window.orderBy($"tmp_min")))
.withColumn("max", lead($"tmp_min", 1).over(Window.orderBy($"tmp_max")))
// only select null-groups
.where($"null_block")
.select($"min", $"max")
.orderBy($"min")
.show()
gives
+----------+----------+
| min| max|
+----------+----------+
|2016-01-01|2016-01-04|
|2016-01-05|2016-01-09|
+----------+----------+

I don't have a working solution but I do have a few recommendations.
Look at using a lag; you will also have to change that code a bit to produce a lead column.
Now assume you have your lag and lead column. Your resultant dataframe will now look like this:
date value lag_value lead_value
2016-01-01 1 NULL 1
2016-01-02 NULL NULL 1
2016-01-03 NULL 2 NULL
2016-01-04 2 3 NULL
2016-01-05 3 NULL 2
2016-01-06 NULL NULL 3
2016-01-07 NULL NULL NULL
2016-01-08 NULL 1 NULL
2016-01-09 1 1 NULL
Now what you want to do is just filter by the following conditions:
min date:
df.filter("value IS NOT NULL AND lag_value IS NULL")
max date:
df.filter("value IS NULL AND lead_value IS NOT NULL")
If you want to be a bit more advanced, you can also use a when command to create a new column which states if the date is a start or end date for a null group:
date value lag_value lead_value group_date_type
2016-01-01 1 NULL 1 start
2016-01-02 NULL NULL 1 NULL
2016-01-03 NULL 2 NULL NULL
2016-01-04 2 3 NULL end
2016-01-05 3 NULL 2 start
2016-01-06 NULL NULL 3 NULL
2016-01-07 NULL NULL NULL NULL
2016-01-08 NULL 1 NULL NULL
2016-01-09 1 1 NULL end
This can be created with something that looks like this:
from pyspark.sql import functions as F
df_2.withColumn('group_date_type',
F.when("value IS NOT NULL AND lag_value IS NULL", start)\
.when("value IS NULL AND lead_value IS NOT NULL", end)\
.otherwise(None)
)

Related

How to fill a columns based on the null values in another column in pandas

I have a dataframe like the below:
Emp_code Leave_applied Leave_approved
0 15-Jan-2021 15-Jan-2021
2 18-Jan-2021 15-Jan-2021
3 20-Jan-2021 np.nan
4 15-Jan-2021 18-Jan-2021
I need to add a new column as leave type based on the below conditions:
if leave_applied greater than leave_approved, leave_type=unplanned
if leave_applied lesser than leave_approved, leave_type=planned
if leave_applied == leave_approved, leave_type=planned
if leave_approved == np.nan then leave_type= missing data
Required output
Emp_code Leave_applied Leave_approved Leave type
0 15-Jan-2021 15-Jan-2021
Planned
2 18-Jan-2021 15-Jan-2021 unplanned
3 20-Jan-2021 np.nan missing data
4 15-Jan-2021 18-Jan-2021 planned
i tried doing
df[leave_type] = np.where(df['Leave_applied'] > df['Leave_approved'],unplanned,
(np.where(df['Leave_approved'] == np.nan, 'Missing_data', 'Planned)))
The code runs but I couldn't find any values as missing data in my dataframe.

You can try np.select. Idea is that comparing NaT with any date is False, so leave that as default
df['Leave_applied'] = pd.to_datetime(df['Leave_applied'], errors='coerce')
df['Leave_approved'] = pd.to_datetime(df['Leave_approved'], errors='coerce')
df['Leave type'] = np.select(
[df['Leave_applied'] > df['Leave_approved'],
df['Leave_applied'] <= df['Leave_approved'],
],
['unplanned',
'planned',
],
default='missing data'
)
print(df)
Emp_code Leave_applied Leave_approved Leave type
0 0 2021-01-15 2021-01-15 planned
1 2 2021-01-18 2021-01-15 unplanned
2 3 2021-01-20 NaT missing data
3 4 2021-01-15 2021-01-18 planned

First convert values to datetimes by to_datetime and for test missing values use Series.isna:
df['Leave_applied'] = pd.to_datetime(df['Leave_applied'])
df['Leave_approved'] = pd.to_datetime(df['Leave_approved'])
df['leave_type'] = np.where(df['Leave_applied'] > df['Leave_approved'],'unplanned',
(np.where(df['Leave_approved'].isna(), 'Missing_data', 'Planned')))
print (df)
Emp_code Leave_applied Leave_approved leave_type
0 0 2021-01-15 2021-01-15 Planned
1 2 2021-01-18 2021-01-15 unplanned
2 3 2021-01-20 NaT Missing_data
3 4 2021-01-15 2021-01-18 Planned
Or use numpy.select:
df['leave_type'] = np.select([df['Leave_approved'].isna(),
df['Leave_applied'] > df['Leave_approved']],
['Missing_data', 'unplanned'], 'Planned')

How to bulild a null notnull matrix in pandas dataframe

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
Here's my expected output
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7

Assuming Null is NaN, here's one option. Using isna + sum to count the NaNs, then find the difference between df length and number of NaNs for Notnulls. Then construct a DataFrame.
nulls = df.drop(columns='Id').isna().sum()
notnulls = nulls.rsub(len(df))
out = pd.DataFrame.from_dict({'Null':nulls, 'Notnull':notnulls}, orient='index')
out['Total'] = out.sum(axis=1)
If you're into one liners, we could also do:
out = (df.drop(columns='Id').isna().sum().to_frame(name='Nulls')
.assign(Notnull=df.drop(columns='Id').notna().sum()).T
.assign(Total=lambda x: x.sum(axis=1)))
Output:
Column_A Column_B Column_C Total
Nulls 2 1 2 5
Notnull 2 3 2 7

Use Series.value_counts for non missing values:
df = (df.replace('Null', np.nan)
.set_index('Id', 1)
.notna()
.apply(pd.value_counts)
.rename({True:'Notnull', False:'Null'}))
df['Total'] = df.sum(axis=1)
print (df)
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7

How to do null combination in pandas dataframe

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
If at least one column is null, Combination will be Null
Id Column_A Column_B Column_C Combination
1 Null 7 Null Null
2 8 7 Null Null
3 Null 8 7 Null
4 8 Null 8 Null

Assuming Null is NaN, we could use isna + any:
df['Combination'] = df.isna().any(axis=1).map({True: 'Null', False: 'Notnull'})
If Null is a string, we could use eq + any:
df['Combination'] = df.eq('Null').any(axis=1).map({True: 'Null', False: 'Notnull'})
Output:
Id Column_A Column_B Column_C Combination
0 1 Null 7 Null Null
1 2 8 7 Null Null
2 3 Null 8 7 Null
3 4 8 Null 8 Null

Use DataFrame.isna with DataFrame.any and pass to numpy.where:
df['Combination'] = np.where(df.isna().any(axis=1), 'Null','Notnull')
df['Combination'] = np.where(df.eq('Null').any(axis=1), 'Null','Notnull')

RANK() OVER Partition with resetting in dependence of another column

I have in a table 2 different IDs and the timestamp, which I would like to rank. But the peculiarity is that I want to rank the S_ID until there is an entry at O_ID. Once there is an entry at O_ID, I want the next rank at S_ID to start at 1.
Here is an example:
select
S_ID,
timestamp,
O_ID,
rank() OVER (PARTITION BY S_ID ORDER BY timestamp asc) AS RANK
from table
order by S_ID, timestamp;
S_ID
Timestamp
O_ID
Rank
2e114e9f
2021-11-26 08:57:44.049
NULL
1
2e114e9f
2021-12-26 17:07:26.272
NULL
2
2e114e9f
2021-12-27 08:13:24.277
NULL
3
2e114e9f
2021-12-29 11:32:56.952
2287549
4
2e114e9f
2021-12-30 13:41:28.821
NULL
5
2e114e9f
2021-12-30 19:53:28.590
NULL
6
2e114e9f
2022-02-05 09:50:54.104
2333002
7
2e114e9f
2022-02-19 10:14:31.389
NULL
8
How can I now add another rank in dependence of an entry in the column O_ID?
So the outcome should be:
S_ID
Timestamp
O_ID
Rank S_ID
Rank both
2e114e9f
2021-11-26 08:57:44.049
NULL
1
1
2e114e9f
2021-12-26 17:07:26.272
NULL
2
2
2e114e9f
2021-12-27 08:13:24.277
NULL
3
3
2e114e9f
2021-12-29 11:32:56.952
2287549
4
4
2e114e9f
2021-12-30 13:41:28.821
NULL
5
1
2e114e9f
2021-12-30 19:53:28.590
NULL
6
2
2e114e9f
2022-02-05 09:50:54.104
2333002
7
3
2e114e9f
2022-02-19 10:14:31.389
NULL
8
1
I am happy about any food for thought!!!!

Looks like the gaps and islands approach can be helpful here - use lag to split data into groups (based on current and previous equality with some null handling) and then use group value as partition for the rank() function.
-- sample data
WITH dataset (S_ID, Timestamp, O_ID) AS (
VALUES ('2e114e9f', timestamp '2021-11-26 08:57:44.049', NULL),
('2e114e9f', timestamp '2021-12-26 17:07:26.272', NULL),
('2e114e9f', timestamp '2021-12-27 08:13:24.277', NULL),
('2e114e9f', timestamp '2021-12-29 11:32:56.952', 2287549),
('2e114e9f', timestamp '2021-12-30 13:41:28.821', NULL),
('2e114e9f', timestamp '2021-12-30 19:53:28.590', NULL),
('2e114e9f', timestamp '2022-02-05 09:50:54.104', 2333002),
('2e114e9f', timestamp '2022-02-19 10:14:31.389', NULL)
)
--query
select S_ID,
Timestamp,
O_ID,
rank() OVER (PARTITION BY S_ID, grp ORDER BY timestamp asc) AS RANK
from(
select *,
sum(if(prev is not null and (O_ID is null or O_ID != prev), 1, 0))
OVER (PARTITION BY S_ID ORDER BY timestamp asc) as grp
from (
select *,
lag(O_ID) OVER (PARTITION BY S_ID ORDER BY timestamp asc) AS prev
from dataset
)
)
Output:
S_ID
Timestamp
O_ID
RANK
2e114e9f
2021-11-26 08:57:44.049
1
2e114e9f
2021-12-26 17:07:26.272
2
2e114e9f
2021-12-27 08:13:24.277
3
2e114e9f
2021-12-29 11:32:56.952
2287549
4
2e114e9f
2021-12-30 13:41:28.821
1
2e114e9f
2021-12-30 19:53:28.590
2
2e114e9f
2022-02-05 09:50:54.104
2333002
3
2e114e9f
2022-02-19 10:14:31.389
1

Pyspark - filter dataframe and create rank columns

I have a situation where I want to create rank columns in a dataframe based on different conditions and set first rank as true and others as false. Below is a sample dataframe:
Column1 Column2 Column3 Column4
ABC X1 null 2016-08-21 11:31:08
ABC X1 Test 2016-08-22 11:31:08
ABC X1 null 2016-08-20 11:31:08
PQR X1 Test 2016-08-23 11:31:08
PQR X1 Test 2016-08-24 11:31:08
PQR X1 null 2016-08-24 11:31:08
Here I want to create Rank columns based on below conditions:
Rank1: Calculate rank on Column1 for rows where Column2 is X1 and Column3 is null and order by Column4
Rank2: Calculate rank on Column1 for rows where Column2 is X1 and Column3 is Test and order by Column4
So the expected outcome would be:
Column1 Column2 Column3 Column4 Rank1 Rank2
ABC X1 null 2016-08-21 11:31:08 2 null
ABC X1 Test 2016-08-22 11:31:08 null 1
ABC X1 null 2016-08-20 11:31:08 1 null
PQR X1 Test 2016-08-23 11:31:08 null 1
PQR X1 Test 2016-08-24 11:31:08 null 2
PQR X1 null 2016-08-24 11:31:08 1 null
I tried to do this using when to filter out data but then the rank were not starting from 1.
df = df.withColumn("Rank1", F.when((df.Column2 == 'X1') & (df.Column3.isNull()), rank().over(Window.partitionBy('Column1').orderBy('Column4')))
This does give me the sequential order but sequence is random. I need to label the first rank so it is important for me to know it.
Other option I tried was to filter data in a temporary dataframe and calculate rank and join it back to main dataframe. But the dataframe size is big and multiple columns are to be calculated so it is giving out of memory error. Any help on solving this problem would be really appreciated.

You need to add the condition to the order by clause of the partitionby window.
This should work for you:
condition_rank1 = (col("column2") == 'X1') & (col("column3").isNull())
condition_rank2 = (col("column2") == 'X1') & (col("column3") == 'Test')
w_rank1 = Window.partitionBy('column1').orderBy(*[when(condition_rank1, lit(1)).desc(), col("column4")])
w_rank2 = Window.partitionBy('column1').orderBy(*[when(condition_rank2, lit(1)).desc(), col("column4")])
df.withColumn("Rank1", when(condition_rank1, rank().over(w_rank1))) \
.withColumn("Rank2", when(condition_rank2, rank().over(w_rank2))) \
.show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark find NULL value blocks in Series of values - apache-spark

Related

How to fill a columns based on the null values in another column in pandas

How to bulild a null notnull matrix in pandas dataframe

How to do null combination in pandas dataframe

RANK() OVER Partition with resetting in dependence of another column

Pyspark - filter dataframe and create rank columns

Categories

Resources