Python Version:3.6
Pandas Version:0.21.1
How do I get from
print(df_raw)
device_id temp_a temp_b temp_c
0 0 0.2 0.8 0.6
1 0 0.1 0.9 0.4
2 1 0.3 0.7 0.2
3 2 0.5 0.5 0.1
4 2 0.1 0.9 0.4
5 2 0.7 0.3 0.9
to
print(df_except2)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Code of data:
df_raw = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
print(df_raw)
df_except = pd.DataFrame({'device_id' : ['0','1','2'],
'temp_a':[0.2,0.3,0.5],
'temp_b':[0.8,0.7,0.5],
'temp_c':[0.6,0.2,0.1],
'temp_a_1':[0.1,None,0.1],
'temp_b_1':[0.9,None,0.9],
'temp_c_1':[0.4,None,0.4],
'temp_a_2':[None,None,0.7],
'temp_b_2':[None,None,0.3],
'temp_c_2':[None,None,0.9],
})
df_except2 = df_except[['device_id','temp_a','temp_b','temp_c','temp_a_1','temp_b_1','temp_c_1','temp_a_2','temp_b_2','temp_c_2']]
print(df_except2)
Note:
1. Number of Multiple rows is unknow.
2. I refer to the following answer :
Pandas Dataframe - How to combine multiple rows to one
But this answer just can deal with one column.
Use:
g = df_raw.groupby('device_id').cumcount()
df = df_raw.set_index(['device_id', g]).unstack().sort_index(axis=1, level=1)
df.columns = ['{}_{}'.format(i,j) if j != 0 else '{}'.format(i) for i, j in df.columns]
df = df.reset_index()
print (df)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Explanation:
First count groups by cumcount by column device_id
Create MultiIndex by set_index and Series g
Reshape by unstack
Sort second level of MultiIndex in columns by sort_index
Change columns names by list comprehension
Last reset_index for column from index
code:
import numpy as np
device_id_list = df_raw['device_id'].tolist()
device_id_list = list(np.unique(device_id_list))
append_df = pd.DataFrame()
for device_id in device_id_list:
tmp_df = df_raw.query('device_id=="%s"'%(device_id))
if len(tmp_df)>1:
one_raw_list=[]
for i in range(0,len(tmp_df)):
one_raw_df = tmp_df.iloc[i:i+1]
one_raw_list.append(one_raw_df)
tmp_combine_df = pd.DataFrame()
for i in range(0,len(one_raw_list)-1):
next_raw = one_raw_list[i+1].drop(columns=['device_id']).reset_index(drop=True)
new_name_list=[]
for old_name in list(next_raw.columns):
new_name_list.append(old_name+'_'+str(i+1))
next_raw.columns = new_name_list
if i==0:
current_raw = one_raw_list[i].reset_index(drop=True)
tmp_combine_df = pd.concat([current_raw, next_raw], axis=1)
else:
tmp_combine_df = pd.concat([tmp_combine_df, next_raw], axis=1)
tmp_df = tmp_combine_df
tmp_df_columns = tmp_df.columns
append_df_columns = append_df.columns
append_df = pd.concat([append_df,tmp_df],ignore_index =True)
if len(tmp_df_columns) > len(append_df_columns):
append_df = append_df[tmp_df_columns]
else:
append_df = append_df[append_df_columns]
print(append_df)
Output:
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
df = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
cols_of_interest = df.columns.drop('device_id')
df["C"] = "C_" + (df.groupby("device_id").cumcount() + 1).astype(str)
df.pivot_table(index="device_id", values=cols_of_interest, columns="C")
Output:
temp_a temp_b temp_c
C C_1 C_2 C_3 C_1 C_2 C_3 C_1 C_2 C_3
device_id
0 0.2 0.1 NaN 0.8 0.9 NaN 0.6 0.4 NaN
1 0.3 NaN NaN 0.7 NaN NaN 0.2 NaN NaN
2 0.5 0.1 0.7 0.5 0.9 0.3 0.1 0.4 0.9
Related
I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum().
But this is not working when some of the rows have NaN entry in a single column.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
pprint(dfs)
grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
df["name"] = f'name{i}'
i=i+1
df = df.append(pd.Series(dtype='object'), ignore_index=True)
dfm = dfm.append(df, ignore_index=True)
print(dfm)
Input df:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B NaN
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
Output obtained:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 2.0 2.0 B 2.0 name1
5 NaN NaN NaN NaN NaN
6 0.0 2.0 B 3.0 name2
7 1.0 4.0 A 1.0 name2
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Expected:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
I am basically doing this to append names to the last column of the dataframe after splitting df
using
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
and appending NaN rows.
Again, I use the NaN rows to split the df into a list and add new column. But dfs.isnull().all(axis=1).cumsum() isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.
Suggestions on how to get the expected output will be really helpful.
Setup
df = pd.DataFrame(d)
print(df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3 0 2 B NaN
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 1.0
Simplify your approach
# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name' + m.cumsum().astype(str)
# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index)
last_nan_row = pd.DataFrame(index=df.index[[-1]])
# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)
Result
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
Alternatively if you still want to start with the initial input as dfs, here is another approach:
dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name' + m.cumsum().astype(str)
Given a following dataframe df:
date mom_pct
0 2020-1-31 1.4
1 2020-2-29 0.8
2 2020-3-31 -1.2
3 2020-4-30 -0.9
4 2020-5-31 -0.8
5 2020-6-30 -0.1
6 2020-7-31 0.6
7 2020-8-31 0.4
8 2020-9-30 0.2
9 2020-10-31 -0.3
10 2020-11-30 -0.6
11 2020-12-31 0.7
12 2021-1-31 1.0
13 2021-2-28 0.6
14 2021-3-31 -0.5
15 2021-4-30 -0.3
16 2021-5-31 -0.2
17 2021-6-30 -0.4
18 2021-7-31 0.3
19 2021-8-31 0.1
20 2021-9-30 0.0
21 2021-10-31 0.7
22 2021-11-30 0.4
23 2021-12-31 -0.3
24 2022-1-31 0.4
25 2022-2-28 0.6
26 2022-3-31 0.0
27 2022-4-30 0.4
28 2022-5-31 -0.2
I want to compare the chain ratio value of a month of the current year to the value of the month of the previous year. Assume that the value of the same period last year is y_t-1, and the current value of this year is y_t. I will create a new column according to the following rules:
If y_t = y_t-1, returns 0 for new column;
If y_t ∈ (y_t-1, y_t-1 + 0.3], returns 1;
If y_t ∈ (y_t-1 + 0.3, y_t-1 + 0.5], returns 2;
If y_t > (y_t-1 + 0.5), returns 3;
If y_t ∈ [y_t-1 - 0.3, y_t-1), returns -1;
If y_t ∈ [y_t-1 - 0.5, y_t-1 - 0.3), returns -2;
If y_t < (y_t-1 - 0.5), returns -3
The expected result:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
I attempt to create multiple columns and ranges, then check mom_pct is in which range. Is it possible to do that in a more effecient way? Thanks.
df1['mom_pct_zero'] = df1['mom_pct'].shift(12)
df1['mom_pct_pos1'] = df1['mom_pct'].shift(12) + 0.3
df1['mom_pct_pos2'] = df1['mom_pct'].shift(12) + 0.5
df1['mom_pct_neg1'] = df1['mom_pct'].shift(12) - 0.3
df1['mom_pct_neg2'] = df1['mom_pct'].shift(12) - 0.5
I would do it as follows
def categorize(v):
if np.isnan(v) or v == 0.:
return v
sign = -1 if v < 0 else 1
eps = 1e-10
if abs(v) <= 0.3 + eps:
return sign * 1
if abs(v) <= 0.5 + eps:
return sign * 2
return sign * 3
df['categorial_mom_pct'] = df['mom_pct'].diff(12).map(categorize)
print(df)
Note that I added a very small eps to the threshold to counter the precision issue with floating point arithmetic
abs(-0.3) <= 0.3 # True
abs(-0.4 + 0.1) <= 0.3 # False
abs(-0.4 + 0.1) <= 0.3 + 1e-10 # True
Out:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
Simply we can calculate mean by axis:
import pandas as pd
df=pd.DataFrame({'A':[1,1,0,1,0,1,1,0,1,1,1],
'b':[1,1,0,1,0,1,1,0,1,1,1],
'c':[1,1,0,1,0,1,1,0,1,1,1]})
# max_of_three columns
mean= np.max(df.mean(axis=1))
How to do this same this with rolling mean ?
I tried 1:
# max_of_three columns
mean=df.rolling(2).mean(axis=1)
got this error:
UnsupportedFunctionCall: numpy operations are not valid with window objects. Use .rolling(...).mean() instead
I tried 2:
def tt(x):
x=pd.DataFrame(x)
b1=np.max(x.mean(axis=1))
return b1
# max_of_three columns
mean=df.rolling(2).apply(tt,raw=True)
But from here I get three columns in result, in real should be 1 value for each moving window.
Where I am doing mistake? or any other efficient way to doing this.
You use the axis argument in rolling as:
df.rolling(2, axis=0).mean()
>>> A b c
0 NaN NaN NaN
1 1.0 1.0 1.0
2 0.5 0.5 0.5
3 0.5 0.5 0.5
4 0.5 0.5 0.5
5 0.5 0.5 0.5
6 1.0 1.0 1.0
7 0.5 0.5 0.5
8 0.5 0.5 0.5
9 1.0 1.0 1.0
10 1.0 1.0 1.0
r = df.rolling(2, axis=1).mean()
r
>>> A b c
0 NaN 1.0 1.0
2 NaN 0.0 0.0
3 NaN 1.0 1.0
4 NaN 0.0 0.0
5 NaN 1.0 1.0
6 NaN 1.0 1.0
7 NaN 0.0 0.0
8 NaN 1.0 1.0
9 NaN 1.0 1.0
10 NaN 1.0 1.0
r.max()
>>> A NaN
b 1.0
c 1.0
dtype: float64
I do got some data within a pandas DataFrame looking like this.
df =
A B
time
0.1 10.0 1
0.15 12.1 2
0.19 4.0 2
0.21 5.0 2
0.22 6.0 2
0.25 7.0 1
0.3 8.1 1
0.4 9.45 2
0.5 3.0 1
Based on the following condition I look for a generic solution to find the first and last index of every subset.
cond = df.B == 2
So far I tried using the groupby concept but without the expected result.
df_1 = cond.reset_index()
df_2 = df_1.groupby(df_1['B']).agg(['first','last']).reset_index()
This is the output I got.
B time
first last
0 False 0.1 0.5
1 True 0.15 0.4
This is the output I like to get.
B time
first last
0 False 0.1 0.1
1 True 0.15 0.22
2 False 0.25 0.3
3 True 0.4 0.4
3 False 0.5 0.5
How can I accomplish this by a more or less generic approach?
Create helper Series by Series.shift with Series.ne and cumulative sum by Series.cumsum for groups by consecutive values, then for aggregation is used dictionary:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg({'B':'first','time': ['first','last']}).reset_index(drop=True)
print (df_2)
B time
first first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
If want avoid MultiIndex use named aggregations:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg(B=('B','first'),
first=('time','first'),
last=('time','last')).reset_index(drop=True)
print (df_2)
B first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
Suppose you had a DataFrame with a number of columns / Series- say five for example. If the fifth column (named 'Updated Col') had values, in addition to nans, what would be the best way to insert values into 'Updated Col' from other columns in place of the nans based on a preferred column order?
e.g. my dataframe looks something like this;
Date 1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 Nan
12/03/2017 0:40 0.1 Nan
12/03/2017 0:50 0.6 0.5 Nan
12/03/2017 1:00 0.4 0.3 Nan
12/03/2017 1:10 0.3 0.2 Nan
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..and say for example I wanted the values from column 3 as a priority, followed by 2, then 1, i would expect the DataFrame to look like this;
1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 0.7
12/03/2017 0:40 0.1 0.1
12/03/2017 0:50 0.6 0.5 0.5
12/03/2017 1:00 0.4 0.3 0.3
12/03/2017 1:10 0.3 0.2 0.2
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..values would be input from the lower priority columns only if the higher priority columns were empty / NaN.
What would be the best way to do this?
I've tried numerous np.where attempts but cant work out what the best way would be?
Many thanks in advance.
You can use fillna with forward filling (ffill) and then select column:
updated_col = 'Updated Col'
#define columns for check, maybe [1,2,3,4] if integer colum names
cols = ['1','2','3','4'] + [updated_col]
print (df[cols].ffill(axis=1))
1 2 3 4 Updated Col
0 0.4 0.4 0.4 0.4 0.9
1 0.4 0.4 0.4 0.4 0.1
2 0.4 0.4 0.4 0.4 0.6
3 0.9 0.9 0.7 0.7 0.7
4 0.1 0.1 0.1 0.1 0.1
5 0.6 0.6 0.6 0.5 0.5
6 0.4 0.4 0.3 0.3 0.3
7 0.3 0.3 0.3 0.2 0.2
8 0.9 0.9 0.9 0.9 0.8
9 0.9 0.9 0.9 0.9 0.8
10 0.0 0.0 0.0 0.0 0.9
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
EDIT:
Thank you shivsn for comments.
If have Nan (string values) in DataFrame what are not NaNs (missing values) or empty string values is necessary first replace:
updated_col = 'Updated Col'
cols = ['1','2','3','4'] + ['Updated Col']
d = {'Nan':np.nan, '': np.nan}
df = df.replace(d)
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9