PySpark window function - within n months from current row - apache-spark

I want to remove all rows within x months of current row (before and after based on date), when the current row is equal to 1.
E.g. given this PySpark df:
id
date
target
a
"2020-01-01"
0
a
"2020-02-01"
0
a
"2020-03-01"
0
a
"2020-04-01"
1
a
"2020-05-01"
0
a
"2020-06-01"
0
a
"2020-07-01"
0
a
"2020-08-01"
0
a
"2020-09-01"
0
a
"2020-10-01"
1
a
"2020-11-01"
0
b
"2020-01-01"
0
b
"2020-02-01"
0
b
"2020-03-01"
0
b
"2020-05-01"
1
(Notice, April month does not exit for id b)
If using an x value of 2, the resulting df would be:
id
date
target
a
"2020-01-01"
0
a
"2020-04-01"
1
a
"2020-07-01"
0
a
"2020-10-01"
1
b
"2020-01-01"
0
b
"2020-02-01"
0
b
"2020-05-01"
1
I am able to remove xth row before and after row of interest using the code from below, but I want to remove all rows between current row and x both ways based on date.
window = 2
windowSpec = Window.partitionBy("id").orderBy(['id','date'])
df= df.withColumn("lagvalue", lag('target', window).over(windowSpec))
df= df.withColumn("leadvalue", lead('target', window).over(windowSpec))
df= df.where(col("lagvalue") == 0 & col("leadvalue") == 0)

In your case, rangeBetween can be very useful. It pays attention to the values and takes only the values which fall into the range. E.g. rangeBetween(-2, 2) would take all the values from 2 below to 2 above. As rangeBetween does not work with dates (or strings), I translated them into integers using months_between.
from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
[('a', '2020-01-01', 0),
('a', '2020-02-01', 0),
('a', '2020-03-01', 0),
('a', '2020-04-01', 1),
('a', '2020-05-01', 0),
('a', '2020-06-01', 0),
('a', '2020-07-01', 0),
('a', '2020-08-01', 0),
('a', '2020-09-01', 0),
('a', '2020-10-01', 1),
('a', '2020-11-01', 0),
('b', '2020-01-01', 0),
('b', '2020-02-01', 0),
('b', '2020-03-01', 0),
('b', '2020-05-01', 1)],
['id', 'date', 'target']
)
window = 2
windowSpec = Window.partitionBy('id').orderBy(F.months_between('date', F.lit('1970-01-01'))).rangeBetween(-window, window)
df = df.withColumn('to_remove', F.sum('target').over(windowSpec) - F.col('target'))
df = df.where(F.col('to_remove') == 0).drop('to_remove')
df.show()
# +---+----------+------+
# | id| date|target|
# +---+----------+------+
# | a|2020-01-01| 0|
# | a|2020-04-01| 1|
# | a|2020-07-01| 0|
# | a|2020-10-01| 1|
# | b|2020-01-01| 0|
# | b|2020-02-01| 0|
# | b|2020-05-01| 1|
# +---+----------+------+

Related

PySpark join on ID then on year and month from 'date' column

I have 2 PySpark dataframes and want to join on "ID", then on a year from "date1" and "date2" columns and then on month of the same date columns.
df1:
ID col1 date1
1 1 2018-01-05
1 2 2018-02-05
2 4 2018-04-05
2 1 2018-05-05
3 1 2019-01-05
3 4 2019-02-05
df2:
ID col2 date2
1 1 2018-01-08
1 1 2018-02-08
2 4 2018-04-08
2 3 2018-05-08
3 1 2019-01-08
3 4 2019-02-08
Expected output:
ID col1 date1 col2 date2
1 1 2018-01-05 1 2018-01-08
1 2 2018-02-05 1 2018-02-08
2 4 2018-04-05 4 2018-04-08
2 1 2018-05-05 3 2018-05-08
3 1 2019-01-05 1 2019-01-08
3 4 2019-02-05 4 2019-02-08
I tried something along the lines of:
df = df1.join(df2, (ID & (df1.F.year(date1) == df2.F.year(date2)) & (df1.F.month(date1) == df2.F.month(date2))
How to join on date's month and year?
You can to it like this:
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on)
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 1, '2018-01-05'),
(1, 2, '2018-02-05'),
(2, 4, '2018-04-05'),
(2, 1, '2018-05-05'),
(3, 1, '2019-01-05'),
(3, 4, '2019-02-05')],
['ID', 'col1', 'date1'])
df2 = spark.createDataFrame(
[(1, 1, '2018-01-08'),
(1, 1, '2018-02-08'),
(2, 4, '2018-04-08'),
(2, 3, '2018-05-08'),
(3, 1, '2019-01-08'),
(3, 4, '2019-02-08')],
['ID', 'col2', 'date2'])
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on).drop(df2.ID)
df.show()
# +---+----+----------+----+----------+
# | ID|col1| date1|col2| date2|
# +---+----+----------+----+----------+
# | 1| 1|2018-01-05| 1|2018-01-08|
# | 1| 2|2018-02-05| 1|2018-02-08|
# | 2| 4|2018-04-05| 4|2018-04-08|
# | 2| 1|2018-05-05| 3|2018-05-08|
# | 3| 1|2019-01-05| 1|2019-01-08|
# | 3| 4|2019-02-05| 4|2019-02-08|
# +---+----+----------+----+----------+

How to get duplicated values in a data frame when the column is a list?

Good morning!
I have a data frame with several columns. One of this columns, data, has lists as content. Below I show a little example (id is just an example with random information):
df =
id data
0 a [1, 2, 3]
1 h [3, 2, 1]
2 bf [1, 2, 3]
What I want is to get rows with duplicated values in column data, I mean, in this example, I should get rows 0 and 2, because the values in its column data are the same (list [1, 2, 3]). However, this can't be achieved with df.duplicated(subset = ['data']) due to list is an unhashable type.
I know that it can be done getting two rows and comparing data directly, but my real data frame can have 1000 rows or more, so I can't compare one by one.
Hope someone knows it!
Thanks you very much in advance!
IIUC, We can create a new DataFrame from df['data'] and then check with DataFrame.duplicated
You can use:
m = pd.DataFrame(df['data'].tolist()).duplicated(keep=False)
df.loc[m]
id data
0 a [1, 2, 3]
2 bf [1, 2, 3]
Expanding on Quang's comment:
Try
In [2]: elements = [(1,2,3), (3,2,1), (1,2,3)]
...: df = pd.DataFrame.from_records(elements)
...: df
Out[2]:
0 1 2
0 1 2 3
1 3 2 1
2 1 2 3
In [3]: # Add a new column of tuples
...: df["new"] = df.apply(lambda x: tuple(x), axis=1)
...: df
Out[3]:
0 1 2 new
0 1 2 3 (1, 2, 3)
1 3 2 1 (3, 2, 1)
2 1 2 3 (1, 2, 3)
In [4]: # Remove duplicate rows (Keeping the first one)
...: df.drop_duplicates(subset="new", keep="first", inplace=True)
...: df
Out[4]:
0 1 2 new
0 1 2 3 (1, 2, 3)
1 3 2 1 (3, 2, 1)
In [5]: # Remove the new column if not required
...: df.drop("new", axis=1, inplace=True)
...: df
Out[5]:
0 1 2
0 1 2 3
1 3 2 1

How to extract max length row with pandas

I would like to extract row which is max in Dataframe.
In following case, I would like to get id 2 row, because it includes max length 6 in B column bbbbbb.
|id|A |B |
|1 |abc |aaa |
|2 |abb |bbbbbb|
|3 |aadd|cccc |
|4 |aadc|ddddd |
|id|A |B |
|2 |abb |bbbbbb|
Please give me some advice. Thanks.
Let's first create the DataFrame with you example:
import pandas as pd
data = {
"id": {0: 1, 1: 2, 2: 3, 3: 4},
"A ": {0: "abc", 1: "abb", 2: "aadd", 3: "aadc"},
"B": {0: "aaa", 1: "bbbbbb", 2: "cccc", 3: "ddddd"}
}
df = pd.DataFrame(data)
Then you can get the row where B is longer and then retrive that row with:
# Index where B is longest
idx = df["B"].apply(len).idxmax()
# Get that row
df.iloc[idx, :]
Get all columns filled by object (obviously strings) by DataFrame.select_dtypes, get length with max per rows and last filter maximal by boolean indexing for match all rows with maximal lengths:
s = df.select_dtypes(object).apply(lambda x: x.str.len()).max(axis=1)
#if no missing values
#s = df.select_dtypes(object).applymap(len).max(axis=1)
df1 = df[s.eq(s.max())]
print (df1)
id A B
1 2 abb bbbbbb
Another idea for only first match by Series.idxmax and DataFrame.loc, added [] for one row DataFrame:
df1 = df.loc[[df.select_dtypes(object).apply(lambda x: x.str.len()).max(axis=1).idxmax()]]
#if no missing values
#df1 = df.loc[[df.select_dtypes(object).applymap(len).max(axis=1).idxmax()]]
print (df1)
id A B
1 2 abb bbbbbb
First, you can find the maximal length per each row and then the row index with a maximal value:
df.loc[df[['A', 'B']].apply(lambda x: x.str.len().max(), axis=1).idxmax()]

How to print the elements of two lists together

I have two lists with a different number of elements. I would like to print each element of the first list with each element of the second list and so on.
a = [1,2,3,4,5]
b = ["banana", "orange", "pear"]
The output I would like to obtain:
1 banana
1 orange
1 pear
2 banana
2 orange
and so on.
I tried this:
a = [1,2,3,4,5]
b = ["banana", "orange", "pear"]
for i,k in zip(a,b):
print(i, k)
but I get this output:
1 banana
2 orange
3 pear
Process finished with exit code 0
You are looking for itertools.product:
>>> import itertools as it
>>> a = [1,2,3,4,5]
>>> b = ["banana", "orange", "pear"]
>>> for x in it.product(a, b):
... print(x)
...
(1, 'banana')
(1, 'orange')
(1, 'pear')
(2, 'banana')
(2, 'orange')
(2, 'pear')
(3, 'banana')
(3, 'orange')
(3, 'pear')
(4, 'banana')
(4, 'orange')
(4, 'pear')
(5, 'banana')
(5, 'orange')
(5, 'pear')

pandas - with restructuring data in data frame

I have a data frame that has data in format
time | name | value
01/01/1970 | A | 1
02/01/1970 | A | 2
03/01/1970 | A | 1
01/01/1970 | B | 5
02/01/1970 | B | 3
I what to change this data to something like
time | A | B
01/01/1970 | 1 | 5
02/01/1970 | 2 | 3
03/01/1970 | 1 | NA
How can I achieve this in pandas? I have tried groupby on dataframe and then joining but its coming out right.
thanks in advance
Use DataFrame.pivot (doc):
import numpy as np
df = pd.DataFrame(
{'name': ['A', 'A', 'A', 'B', 'B'],
'time': ['01/01/1970', '02/01/1970', '03/01/1970', '01/01/1970', '02/01/1970'],
'value': [1, 2, 1, 5, 3]})
print(df.pivot(index='time', columns='name', values='value'))
yields
A B
time
01/01/1970 1 5
02/01/1970 2 3
03/01/1970 1 NaN
Note that time is now the index. If you wish to make it a column, call reset_index():
df.pivot(index='time', columns='name', values='value').reset_index()
# name time A B
# 0 01/01/1970 1 5
# 1 02/01/1970 2 3
# 2 03/01/1970 1 NaN
Use the .pivot function:
df = pd.DataFrame({'time' : [0,1,2,3],
'name': ['A','A', 'B', 'B'], 'value': [10,20,30,40]})
df.pivot(index = 'time', columns = 'name', values = 'value')

Resources