How to dinamically change the shift periods if value is NaN in Pandas? - python-3.x

I have the following DF (simplified):
value
1
4
2
NaN
9
8
7
For example, when I apply the function shift(3)+current_row I get:
value result
1 NaN
4 NaN
2 NaN
NaN NaN
9 13
8 10
7 NaN
But what I need is if a value is NaN, try with periods=N-1:
value result
1 1 shift(3)=NAN -> shift(2)=NAN -> shift(1)=NAN -> current_value
4 5 shift(3)=NAN -> shift(2)=NAN -> shift(1)=1 + current_value
2 3 shift(3)=NAN -> shift(2)=1 + current_value
NaN 1 shift(3)=1 + if current_value == NaN then 0 else current_value
9 13 shift(3)=4 + current_value
8 10 shift(3)=2 + current_value
7 16 shift(3)=NaN -> shift(2)=9 + current_value
If it's possible, in the spirit of the pythonic way.
Thanks and regards.

Shift the value column for the periods in the range 3...1, then create a dataframe from these shifted columns and backfill along the index axis to fill the NaN values, then using iloc select the first row and add this row with the value column
s = pd.DataFrame(df['value'].shift(3 - i) for i in range(3)).bfill().iloc[0]
df['result'] = df['value'].add(s, fill_value=0)
value result
0 1.0 1.0
1 4.0 5.0
2 2.0 3.0
3 NaN 1.0
4 9.0 13.0
5 8.0 10.0
6 7.0 16.0

Related

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

Find no of days gap for a specific ID when it has a flag X in other column

I want to calculate the no. of days gap for when the 'flag' column is equal to 'X' for same IDs.
The dataframe that I have:
ID Date flag
1 1-1-2020 X
1 10-1-2020 null
1 15-1-2020 X
2 1-2-2020 X
2 10-2-2020 X
2 15-2-2020 X
3 15-2-2020 null
The dataframe I want:
ID Date flag no_of_days
1 1-1-2020 X 14
1 10-1-2020 null null
1 15-1-2020 X null
2 1-2-2020 X 9
2 10-2-2020 X 8
2 18-2-2020 X null
3 15-2-2020 null null
Thanks in advance.
First filter rows by X in boolean indexing and then subtract shifted column per groups by DataFrameGroupBy.shift and last convert timedeltas to days by Series.dt.days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['new'] = df[df['flag'].eq('X')].groupby('ID')['Date'].shift(-1).sub(df['Date']).dt.days
print (df)
ID Date flag new
0 1 2020-01-01 X 14.0
1 1 2020-01-10 NaN NaN
2 1 2020-01-15 X NaN
3 2 2020-02-01 X 9.0
4 2 2020-02-10 X 8.0
5 2 2020-02-18 X NaN
6 3 2020-02-15 NaN NaN

Pandas: GroupBy Shift And Cumulative Sum

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.
temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 3.0
4 b 1 1.0
5 b 1 2.0
6 c 1 3.0
7 c 1 1.0
This is wrong because the actual or what I am looking for is as below:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
Thanks a lot in advance.
You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.
temp['transformed'] = \
temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
For more info on transform() please see here:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation
You need using apply , since one function is under groupby object which is cumsum another function shift is for all df
temp['transformed'] = temp.groupby('ID')['X'].apply(lambda x : x.cumsum().shift())
temp
Out[287]:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.
So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:
df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])

Copy and Paste Values Based on a Condition in Python

I am trying to populate column 'C' with values from column 'A' based on conditions in column 'B'. Example: If column 'B' equals 'nan', then row under column 'C' equals the row in column 'A'. If column 'B' does NOT equal 'nan', then leave column 'C' as is (ie 'nan'). Next, the values in column 'A' to be removed (only the values that were copied from column A to C).
Original Dataset:
index A B C
0 6 nan nan
1 6 nan nan
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 3 nan nan
8 4 nan nan
Output:
index A B C
0 nan nan 6
1 nan nan 6
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 nan nan 3
8 nan nan 4
Below is what I have tried so far, but its not working.
def impute_unit(cols):
Legal_Block = cols[0]
Legal_Lot = cols[1]
Legal_Unit = cols[2]
if pd.isnull(Legal_Lot):
return 3
else:
return Legal_Unit
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
Seems like you need
df['C'] = np.where(df.B.isna(), df.A, df.C)
df['A'] = np.where(df.B.isna(), np.nan, df.A)
A different, maybe fancy way to do it would be to swap A and C values only when B is np.nan
m = df.B.isna()
df.loc[m, ['A', 'C']] = df.loc[m, ['C', 'A']].values
In other words, change
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
for
bk_Final_tax['Legal_Unit'] = np.where(df.Legal_Lot.isna(), df.Legal_Block, df.Legal_Unit)
bk_Final_tax['Legal_Block'] = np.where(df.Legal_Lot.isna(), np.nan, df.Legal_Block)

Resources