How to replace the missing values with average of ffill() and bfill() in pandas? - python-3.x

This is a sample dataframe and it containsNA:
x y z datetime
0 2 3 4 02-02-2019
1 NA NA NA 03-02-2019
2 3 5 7 04-02-2019
3 NA NA NA 05-02-2019
4 4 7 9 06-02-2019
Now, i want to fill these NA values and i can do this by using either ffill() or bfill(). But what if want to apply the average of the ffill() & bfill(). Then how can i do this?
The direct average df = (df.fill() + df.bfill()) / 2 didn't work because of datetime column.
The end dataframe should look like this:
x y z datetime
0 2 3 4 02-02-2019
1 2.5 4 5.5 03-02-2019
2 3 5 7 04-02-2019
3 3.5 6 8 05-02-2019
4 4 7 9 06-02-2019

Check with df.interpolate:
df.interpolate()
x y z datetime
0 2.0 3.0 4.0 02-02-2019
1 2.5 4.0 5.5 03-02-2019
2 3.0 5.0 7.0 04-02-2019
3 3.5 6.0 8.0 05-02-2019
4 4.0 7.0 9.0 06-02-2019

Related

pandas - rolling sum last seven days over different rows

Starting from this data frame:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
5
2
01.03.
3
1
01.09.
5
2
01.09.
2
1
01.10.
5
2
01.10.
2
I would like to get a weekly sum of value:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
7
2
01.03.
6
1
01.09.
10
2
01.09.
5
1
01.10.
15
2
01.10.
8
I use this command, but it is not working:
df['value'] = df.groupby('id')['value'].rolling(7).sum()
Any ideas?
You can do groupby and apply.
df['date'] = pd.to_datetime(df['date'], format='%m.%d.')
df['value'] = (df.groupby('id', as_index=False, group_keys=False)
.apply(lambda g: g.rolling('7D', on='date')['value'].sum()))
Note that for 1900-01-10, the rolling window is 1900-01-04, 1900-01-05...1900-01-10
print(df)
id date value
0 1 1900-01-01 2.0
1 2 1900-01-01 3.0
2 1 1900-01-03 7.0
3 2 1900-01-03 6.0
4 1 1900-01-09 10.0
5 2 1900-01-09 5.0
6 1 1900-01-10 10.0
7 2 1900-01-10 4.0

Split value of a row with equal values based on the unique entry

I have a dataframe which looks like this:
Name T1 T2 alpha
A 10 3 30
A 11 5 Nan
A 13 5 Nan
B 5 2 7
B 3 1 Nan
However need to divide the alpha column in equal parts to replace Nan values for each unique names in Name column like this: eg 30 for A becomes 10 each for corresponding row where A is present
Name T1 T2 alpha
A 10 3 10
A 11 5 10
A 13 5 10
B 5 2 3.5
B 3 1 3.5
I tried using explode but it is not working as I want it to look like, any idea here would help
With groupby and transform
df['alpha'] = pd.to_numeric(df['alpha'],errors = 'coerce').fillna(0).groupby(df['Name']).transform('mean')
df
Out[50]:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5
This'll get the job done:
df['alpha'] = (df['alpha'] / df['T2']).ffill()
Output:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

Replace a column value with number using pandas

For the following dataset, I can replace column 1 with the numeric value easily.
df['1'].replace(['A', 'B', 'C', 'D'], [0, 1, 2, 3], inplace=True)
But if I have 3600 or more than that different values in a column, how can I replace it with the numeric values without writing the value of the column.
Please let me know. I don't understand how to do that. If anybody has any solution please share with me.
Thanks in advance.
import pandas as pd
df = pd.DataFrame({1:['A','B','C','C','D','A'],
2:[0.6,0.9,5,4,7,1,],
3:[0.3,1,0.7,8,2,4]})
print(df)
1 2 3
0 A 0.6 0.3
1 B 0.9 1.0
2 C 5.0 0.7
3 C 4.0 8.0
4 D 7.0 2.0
5 A 1.0 4.0
np.where makes it easy.
import numpy as np
df[1] = np.where(df[1]=="A", "0",
np.where(df[1]=="B", "1",
np.where(df[1]=="C","2",
np.where(df[1]=="D","3",np.nan))))
print(df)
1 2 3
0 0 0.6 0.3
1 1 0.9 1.0
2 2 5.0 0.7
3 2 4.0 8.0
4 3 7.0 2.0
5 0 1.0 4.0
But if you have a lot of categories, you might want to think about other ways.
import string
upper=list(string.ascii_uppercase)
a=pd.DataFrame({'Alp':upper})
print(a)
Alp
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
.
.
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z
for k in np.arange(0,26):
a=a.replace(to_replace =upper[k],value =k)
print(a)
Alp
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
.
.
.
21 21
22 22
23 23
24 24
25 25
If there is many values for replace you can use factorize:
df[1] = pd.factorize(df[1])[0] + 1
print (df)
1 2 3
0 1 0.6 0.3
1 2 0.9 1.0
2 3 5.0 0.7
3 3 4.0 8.0
4 4 7.0 2.0
5 1 1.0 4.0
You could do something like
df.loc[df['1'] == 'A','1'] = 0
df.loc[df['1'] == 'B','1'] = 1
### Or
keys = df['1'].unique().tolist()
i = 0
for key in keys
df.loc[df['1'] == key,'1'] = i
i = i+1

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Resources