I am faced with a small problem, the solution of which is certainly very simple, but I cannot find how to do it.
Let's say I have the following pandas dataframe df:
import pandas as pd
X = [0.78, 0.82, 1.03, 1.06, 1.21]
Y = [0.0, 0.2521, 0.4905, 0.5003, 1.0]
df = pd.DataFrame({'X':X, 'Y':Y})
df
X Y
0 0.78 0.0000
1 0.82 0.2521
2 1.03 0.4905
3 1.06 0.5003
4 1.21 1.0000
I want to recover the value of X for which Y exceeds 0.5; in other words, I am looking for a piece of program which creates a new variable val such as:
print (val)
1.06
I imagine only complicated things, style:
df['Z'] = df.apply(lambda row: 0 if row.Y <= 0.5 else 1, axis = 1)
df
X Y Z
0 0.78 0.0000 0
1 0.82 0.2521 0
2 1.03 0.4905 0
3 1.06 0.5003 1
4 1.21 1.0000 1
But this shows me where is the X value I want (first appearance of 1 in Z), but it doesn't extract that value.
How could I do that in a simple way?
We can check with idxmax, notice it will need have one value less than 0.5
df.loc[df.Y.gt(0.5).idxmax(),'Z']=1
df.Z.fillna(0,inplace=True)
df
X Y Z
0 0.78 0.0000 0.0
1 0.82 0.2521 0.0
2 1.03 0.4905 0.0
3 1.06 0.5003 1.0
4 1.21 1.0000 0.0
If would like separated dataframe
df1=df.loc[df.Y.gt(0.5)]
I have pandas data-frame (df) and a list (df_values).
I want to make another data-frame which contains the distribution/percentage a data-point in df belongs to values in the list df_values.
data-frame df is:
A
0 100
1 300
2 150
List df_values (set of referential values) is:
df_values = [[0,200,400,600]]
Desired data-frame:
Here number 100 in df is 0.50 towards 0 and 0.50 towards 200 in df_values. Similarly, 300 in df is 0.50 towards 200 and 0.50 towards 400 in df_values and so on.
0 200 400 600
0 0.50 0.50 0.0 0
1 0.00 0.50 0.5 0
2 0.25 0.75 0.0 0
I've got a pandas dataframe like this. This contains a timestamp, id, foo and bar.
The timestamp data is around every 10 minutes.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.10 0.05
2019-04-14 00:10:00 2 0.30 0.10
For each id, I'd like to create 5 additional rows with timestamp split equally between successive rows, and foo & bar values containing random values between the successive rows.
The start time should be the earliest timestamp for each id and the end time should be the latest timestamp for each id
So the output would be like this.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:02:10 1 0.14 0.06
2019-04-14 00:04:10 1 0.11 0.06
2019-04-14 00:06:10 1 0.29 0.07
2019-04-14 00:08:10 1 0.22 0.09
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.80 0.50
2019-04-14 00:02:00 2 0.45 0.48
2019-04-14 00:04:00 2 0.52 0.42
2019-04-14 00:06:00 2 0.74 0.48
2019-04-14 00:08:00 2 0.41 0.45
2019-04-14 00:10:00 2 0.40 0.40
I can reindex the timestamp column and create additional timestamp rows (eg. Pandas create new date rows and forward fill column values).
But I can't seem to wrap my head around how to compute the random values for foo and bar between the successive rows.
Appreciate if someone can point me in the right direction!
The close, what you need is use date_range with DataFrame.reindex by first and last value of DatetimeIndex:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = (df.set_index('timestamp')
.groupby('id')['foo','bar']
.apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))
Then create helper DataFrame with same size like original and DataFrame.fillna missing values:
df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
foo bar
id
1 2019-04-14 00:00:10.000 0.100000 0.050000
2019-04-14 00:02:08.400 0.903435 0.755841
2019-04-14 00:04:06.800 0.956002 0.253878
2019-04-14 00:06:05.200 0.388454 0.257639
2019-04-14 00:08:03.600 0.225535 0.195306
2019-04-14 00:10:02.000 0.300000 0.100000
2 2019-04-14 00:00:00.000 0.100000 0.050000
2019-04-14 00:02:00.000 0.180865 0.327581
2019-04-14 00:04:00.000 0.417956 0.414400
2019-04-14 00:06:00.000 0.012686 0.800948
2019-04-14 00:08:00.000 0.716216 0.941396
2019-04-14 00:10:00.000 0.300000 0.100000
If the 'randomness' is not as crucial. We can use Series.interpolate which will keep the values between your min and max per group:
df_new = pd.concat([
d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)
df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)
df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())
Which gives the following output:
print(df_new)
timestamp id foo bar
0 2019-04-14 00:00:10.000 1.0 0.10 0.05
1 2019-04-14 00:02:08.400 1.0 0.14 0.06
2 2019-04-14 00:04:06.800 1.0 0.18 0.07
3 2019-04-14 00:06:05.200 1.0 0.22 0.08
4 2019-04-14 00:08:03.600 1.0 0.26 0.09
5 2019-04-14 00:10:02.000 1.0 0.30 0.10
6 2019-04-14 00:00:00.000 2.0 0.10 0.05
7 2019-04-14 00:02:00.000 2.0 0.14 0.06
8 2019-04-14 00:04:00.000 2.0 0.18 0.07
9 2019-04-14 00:06:00.000 2.0 0.22 0.08
10 2019-04-14 00:08:00.000 2.0 0.26 0.09
11 2019-04-14 00:10:00.000 2.0 0.30 0.10
I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)
Python Version:3.6
Pandas Version:0.21.1
How do I get from
print(df_raw)
device_id temp_a temp_b temp_c
0 0 0.2 0.8 0.6
1 0 0.1 0.9 0.4
2 1 0.3 0.7 0.2
3 2 0.5 0.5 0.1
4 2 0.1 0.9 0.4
5 2 0.7 0.3 0.9
to
print(df_except2)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Code of data:
df_raw = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
print(df_raw)
df_except = pd.DataFrame({'device_id' : ['0','1','2'],
'temp_a':[0.2,0.3,0.5],
'temp_b':[0.8,0.7,0.5],
'temp_c':[0.6,0.2,0.1],
'temp_a_1':[0.1,None,0.1],
'temp_b_1':[0.9,None,0.9],
'temp_c_1':[0.4,None,0.4],
'temp_a_2':[None,None,0.7],
'temp_b_2':[None,None,0.3],
'temp_c_2':[None,None,0.9],
})
df_except2 = df_except[['device_id','temp_a','temp_b','temp_c','temp_a_1','temp_b_1','temp_c_1','temp_a_2','temp_b_2','temp_c_2']]
print(df_except2)
Note:
1. Number of Multiple rows is unknow.
2. I refer to the following answer :
Pandas Dataframe - How to combine multiple rows to one
But this answer just can deal with one column.
Use:
g = df_raw.groupby('device_id').cumcount()
df = df_raw.set_index(['device_id', g]).unstack().sort_index(axis=1, level=1)
df.columns = ['{}_{}'.format(i,j) if j != 0 else '{}'.format(i) for i, j in df.columns]
df = df.reset_index()
print (df)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Explanation:
First count groups by cumcount by column device_id
Create MultiIndex by set_index and Series g
Reshape by unstack
Sort second level of MultiIndex in columns by sort_index
Change columns names by list comprehension
Last reset_index for column from index
code:
import numpy as np
device_id_list = df_raw['device_id'].tolist()
device_id_list = list(np.unique(device_id_list))
append_df = pd.DataFrame()
for device_id in device_id_list:
tmp_df = df_raw.query('device_id=="%s"'%(device_id))
if len(tmp_df)>1:
one_raw_list=[]
for i in range(0,len(tmp_df)):
one_raw_df = tmp_df.iloc[i:i+1]
one_raw_list.append(one_raw_df)
tmp_combine_df = pd.DataFrame()
for i in range(0,len(one_raw_list)-1):
next_raw = one_raw_list[i+1].drop(columns=['device_id']).reset_index(drop=True)
new_name_list=[]
for old_name in list(next_raw.columns):
new_name_list.append(old_name+'_'+str(i+1))
next_raw.columns = new_name_list
if i==0:
current_raw = one_raw_list[i].reset_index(drop=True)
tmp_combine_df = pd.concat([current_raw, next_raw], axis=1)
else:
tmp_combine_df = pd.concat([tmp_combine_df, next_raw], axis=1)
tmp_df = tmp_combine_df
tmp_df_columns = tmp_df.columns
append_df_columns = append_df.columns
append_df = pd.concat([append_df,tmp_df],ignore_index =True)
if len(tmp_df_columns) > len(append_df_columns):
append_df = append_df[tmp_df_columns]
else:
append_df = append_df[append_df_columns]
print(append_df)
Output:
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
df = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
cols_of_interest = df.columns.drop('device_id')
df["C"] = "C_" + (df.groupby("device_id").cumcount() + 1).astype(str)
df.pivot_table(index="device_id", values=cols_of_interest, columns="C")
Output:
temp_a temp_b temp_c
C C_1 C_2 C_3 C_1 C_2 C_3 C_1 C_2 C_3
device_id
0 0.2 0.1 NaN 0.8 0.9 NaN 0.6 0.4 NaN
1 0.3 NaN NaN 0.7 NaN NaN 0.2 NaN NaN
2 0.5 0.1 0.7 0.5 0.9 0.3 0.1 0.4 0.9