Update single column based on multiple 'priority' columns

Update single column based on multiple 'priority' columns - python-3.x

Suppose you had a DataFrame with a number of columns / Series- say five for example. If the fifth column (named 'Updated Col') had values, in addition to nans, what would be the best way to insert values into 'Updated Col' from other columns in place of the nans based on a preferred column order?
e.g. my dataframe looks something like this;
Date 1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 Nan
12/03/2017 0:40 0.1 Nan
12/03/2017 0:50 0.6 0.5 Nan
12/03/2017 1:00 0.4 0.3 Nan
12/03/2017 1:10 0.3 0.2 Nan
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..and say for example I wanted the values from column 3 as a priority, followed by 2, then 1, i would expect the DataFrame to look like this;
1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 0.7
12/03/2017 0:40 0.1 0.1
12/03/2017 0:50 0.6 0.5 0.5
12/03/2017 1:00 0.4 0.3 0.3
12/03/2017 1:10 0.3 0.2 0.2
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..values would be input from the lower priority columns only if the higher priority columns were empty / NaN.
What would be the best way to do this?
I've tried numerous np.where attempts but cant work out what the best way would be?
Many thanks in advance.

You can use fillna with forward filling (ffill) and then select column:
updated_col = 'Updated Col'
#define columns for check, maybe [1,2,3,4] if integer colum names
cols = ['1','2','3','4'] + [updated_col]
print (df[cols].ffill(axis=1))
1 2 3 4 Updated Col
0 0.4 0.4 0.4 0.4 0.9
1 0.4 0.4 0.4 0.4 0.1
2 0.4 0.4 0.4 0.4 0.6
3 0.9 0.9 0.7 0.7 0.7
4 0.1 0.1 0.1 0.1 0.1
5 0.6 0.6 0.6 0.5 0.5
6 0.4 0.4 0.3 0.3 0.3
7 0.3 0.3 0.3 0.2 0.2
8 0.9 0.9 0.9 0.9 0.8
9 0.9 0.9 0.9 0.9 0.8
10 0.0 0.0 0.0 0.0 0.9
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
EDIT:
Thank you shivsn for comments.
If have Nan (string values) in DataFrame what are not NaNs (missing values) or empty string values is necessary first replace:
updated_col = 'Updated Col'
cols = ['1','2','3','4'] + ['Updated Col']
d = {'Nan':np.nan, '': np.nan}
df = df.replace(d)
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9

Related

Comparing one month's value of current year with previous year values adding or substracting multiple parameters

Given a following dataframe df:
date mom_pct
0 2020-1-31 1.4
1 2020-2-29 0.8
2 2020-3-31 -1.2
3 2020-4-30 -0.9
4 2020-5-31 -0.8
5 2020-6-30 -0.1
6 2020-7-31 0.6
7 2020-8-31 0.4
8 2020-9-30 0.2
9 2020-10-31 -0.3
10 2020-11-30 -0.6
11 2020-12-31 0.7
12 2021-1-31 1.0
13 2021-2-28 0.6
14 2021-3-31 -0.5
15 2021-4-30 -0.3
16 2021-5-31 -0.2
17 2021-6-30 -0.4
18 2021-7-31 0.3
19 2021-8-31 0.1
20 2021-9-30 0.0
21 2021-10-31 0.7
22 2021-11-30 0.4
23 2021-12-31 -0.3
24 2022-1-31 0.4
25 2022-2-28 0.6
26 2022-3-31 0.0
27 2022-4-30 0.4
28 2022-5-31 -0.2
I want to compare the chain ratio value of a month of the current year to the value of the month of the previous year. Assume that the value of the same period last year is y_t-1, and the current value of this year is y_t. I will create a new column according to the following rules:
If y_t = y_t-1, returns 0 for new column;
If y_t ∈ (y_t-1, y_t-1 + 0.3], returns 1;
If y_t ∈ (y_t-1 + 0.3, y_t-1 + 0.5], returns 2;
If y_t > (y_t-1 + 0.5), returns 3;
If y_t ∈ [y_t-1 - 0.3, y_t-1), returns -1;
If y_t ∈ [y_t-1 - 0.5, y_t-1 - 0.3), returns -2;
If y_t < (y_t-1 - 0.5), returns -3
The expected result:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
I attempt to create multiple columns and ranges, then check mom_pct is in which range. Is it possible to do that in a more effecient way? Thanks.
df1['mom_pct_zero'] = df1['mom_pct'].shift(12)
df1['mom_pct_pos1'] = df1['mom_pct'].shift(12) + 0.3
df1['mom_pct_pos2'] = df1['mom_pct'].shift(12) + 0.5
df1['mom_pct_neg1'] = df1['mom_pct'].shift(12) - 0.3
df1['mom_pct_neg2'] = df1['mom_pct'].shift(12) - 0.5

I would do it as follows
def categorize(v):
if np.isnan(v) or v == 0.:
return v
sign = -1 if v < 0 else 1
eps = 1e-10
if abs(v) <= 0.3 + eps:
return sign * 1
if abs(v) <= 0.5 + eps:
return sign * 2
return sign * 3
df['categorial_mom_pct'] = df['mom_pct'].diff(12).map(categorize)
print(df)
Note that I added a very small eps to the threshold to counter the precision issue with floating point arithmetic
abs(-0.3) <= 0.3 # True
abs(-0.4 + 0.1) <= 0.3 # False
abs(-0.4 + 0.1) <= 0.3 + 1e-10 # True
Out:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0

Creating multiple cohort from the pivot table

I have a requirement like below.
The initial information is a list of gross adds.
201910
201911
201912
202001
202002
20000
30000
32000
40000
36000
I have a pivot table as below.
201910
201911
201912
202001
202002
1000
2000
2400
3200
1800
500
400
300
200
nan
200
150
100
nan
nan
200
100
nan
nan
nan
160
nan
nan
nan
nan
Need to generate the report like below.
Cohort01:
5%
3%
3%
1%
1%
1%
From Cohort02 onwards it will take the average of last value of cohort01.
Similarly for Cohort03 for both nan values it will take the average of corresponding values of cohort01 and cohort2.
Again while calculating for cohort04 it will take the average of previous two cohorts(cohort02 and cohort03 values) to add all three nan value.
Is there anyone who can provide me a solution on this in Python.
The report should be generated as below.
All cohorts should be created separately.

You could try it like this:
res = df.apply(lambda x: round(100/(df_gross.iloc[0]/x),1),axis=1)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.5 8.0 5.0
1 2.5 1.3 0.9 0.5 NaN
2 1.0 0.5 0.3 NaN NaN
3 1.0 0.3 NaN NaN NaN
4 0.8 NaN NaN NaN NaN
for idx,col in enumerate(res.columns[1:],1):
res[col] = res[col].fillna((res.iloc[:,max(idx-2,0)]+res.iloc[:,idx-1])/2)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.50 8.000 5.0000
1 2.5 1.3 0.90 0.500 0.7000
2 1.0 0.5 0.30 0.400 0.3500
3 1.0 0.3 0.65 0.475 0.5625
4 0.8 0.8 0.80 0.800 0.8000

Extract one column into multiple Column csv file

my credit credit_scoring.csv is like this how can i make it in an organised way 14 column and each column has it's corresponding value
Seniority;Home;Time;Age;Marital;Records;Job;Expenses;Income;Assets;Debt;Amount;Price;Status
0 9.0;1.0;60.0;30.0;0.0;1.0;1.0;73.0;129.0;0.0;0...
1 17.0;1.0;60.0;58.0;1.0;1.0;0.0;48.0;131.0;0.0;...
2 10.0;0.0;36.0;46.0;0.0;2.0;1.0;90.0;200.0;3000...
3 0.0;1.0;60.0;24.0;1.0;1.0;0.0;63.0;182.0;2500....
4 0.0;1.0;36.0;26.0;1.0;1.0;0.0;46.0;107.0;0.0;0...
. .................................................
. .................................................
. .................................................
. .................................................

You can simply use read_csv() with sep=';'
Your example data isn't great, but I tried to do the most of it.
I saved it as a.csv and here is the code:
In [1]: import pandas as pd
In [2]: pd.read_csv('a.csv', sep=';')
Out[2]:
Seniority Home Time Age Marital Records Job Expenses Income Assets Debt Amount Price Status
0 9.0 1.0 60.0 30.0 0.0 1.0 1.0 73.0 129.0 0.0 0.0 NaN NaN NaN
1 17.0 1.0 60.0 58.0 1.0 1.0 0.0 48.0 131.0 0.0 NaN NaN NaN NaN
2 10.0 0.0 36.0 46.0 0.0 2.0 1.0 90.0 200.0 3000.0 NaN NaN NaN NaN
3 0.0 1.0 60.0 24.0 1.0 1.0 0.0 63.0 182.0 2500.0 NaN NaN NaN NaN
4 0.0 1.0 36.0 26.0 1.0 1.0 0.0 46.0 107.0 0.0 0.0 NaN NaN NaN

Issue with datetime formatting

I am having an issue with the datetime format of a set of data. The issue is due to the hour of day ranging from 1-24, with the 24th hour set to the wrong day (more specifically, the previous day). I have a sample of the data below,
1/1/2019,14:00,0.2,0.1,0.0,0.2,3.0,36.7,3,153
1/1/2019,15:00,0.2,0.6,0.2,0.4,3.9,36.7,1,199
1/1/2019,16:00,1.8,2.4,0.8,1.6,1.1,33.0,0,307
1/1/2019,17:00,3.0,3.2,0.6,2.6,6.0,32.8,1,310
1/1/2019,18:00,1.6,2.2,0.5,1.7,7.9,33.1,4,293
1/1/2019,19:00,1.7,1.1,0.6,0.6,5.9,35.0,5,262
1/1/2019,20:00,1.0,0.5,0.2,0.2,2.9,32.6,5,201
1/1/2019,21:00,0.6,0.3,0.0,0.4,2.1,31.8,6,182
1/1/2019,22:00,0.4,0.3,0.0,0.4,5.1,31.4,6,187
1/1/2019,23:00,0.8,0.6,0.3,0.3,9.9,30.2,5,227
1/1/2019,24:00,1.0,0.7,0.3,0.4,6.9,27.9,4,225 --- Here the date should be 1/2/2019
1/2/2019,01:00,1.3,0.9,0.5,0.4,4.0,26.9,6,236
1/2/2019,02:00,0.4,0.4,0.2,0.2,5.0,27.3,6,168
1/2/2019,03:00,0.7,0.5,0.3,0.3,6.9,30.2,4,219
1/2/2019,04:00,1.3,0.8,0.5,0.3,5.9,32.3,4,242
1/2/2019,05:00,0.7,0.2,0.0,0.2,3.0,33.8,4,177
1/2/2019,06:00,0.5,0.2,0.2,0.1,5.1,36.1,4,195
1/2/2019,07:00,0.6,0.3,0.2,0.2,9.9,38.0,4,200
1/2/2019,08:00,0.5,0.6,0.4,0.3,6.8,38.9,4,179
1/2/2019,09:00,0.5,0.2,0.0,0.2,3.0,39.0,4,193
1/2/2019,10:00,0.3,0.3,0.2,0.1,4.0,38.7,5,198
1/2/2019,11:00,0.3,0.3,0.2,0.0,4.9,38.4,5,170
1/2/2019,12:00,0.6,0.3,0.3,0.0,2.0,38.4,4,172
1/2/2019,13:00,0.2,0.3,0.2,0.0,2.0,38.8,4,154
1/2/2019,14:00,0.3,0.1,0.0,0.2,1.9,39.3,4,145
This is a fairly large set of data which I need to make a time series plot of, and as such I need to find a way to fix this formatting issue. I was attempting to iterate through the rows and in a pandas dataframe to fix problematic rows, but this does not provide any results. Thank you for any help beforehand.

You can convert date to datetimes by to_datetime and then add time column converted to timedeltas by to_timedelta:
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['time'] + ':00')
Or if need remove time column also:
print (df)
date time a b c d e f g h
0 1/1/2019 14:00 0.2 0.1 0.0 0.2 3.0 36.7 3 153
1 1/1/2019 15:00 0.2 0.6 0.2 0.4 3.9 36.7 1 199
2 1/1/2019 16:00 1.8 2.4 0.8 1.6 1.1 33.0 0 307
3 1/1/2019 17:00 3.0 3.2 0.6 2.6 6.0 32.8 1 310
4 1/1/2019 18:00 1.6 2.2 0.5 1.7 7.9 33.1 4 293
5 1/1/2019 19:00 1.7 1.1 0.6 0.6 5.9 35.0 5 262
6 1/1/2019 20:00 1.0 0.5 0.2 0.2 2.9 32.6 5 201
7 1/1/2019 21:00 0.6 0.3 0.0 0.4 2.1 31.8 6 182
8 1/1/2019 22:00 0.4 0.3 0.0 0.4 5.1 31.4 6 187
9 1/1/2019 23:00 0.8 0.6 0.3 0.3 9.9 30.2 5 227
10 1/1/2019 24:00 1.0 0.7 0.3 0.4 6.9 27.9 4 225
11 1/2/2019 01:00 1.3 0.9 0.5 0.4 4.0 26.9 6 236
12 1/2/2019 02:00 0.4 0.4 0.2 0.2 5.0 27.3 6 168
13 1/2/2019 03:00 0.7 0.5 0.3 0.3 6.9 30.2 4 219
14 1/2/2019 04:00 1.3 0.8 0.5 0.3 5.9 32.3 4 242
15 1/2/2019 05:00 0.7 0.2 0.0 0.2 3.0 33.8 4 177
16 1/2/2019 06:00 0.5 0.2 0.2 0.1 5.1 36.1 4 195
17 1/2/2019 07:00 0.6 0.3 0.2 0.2 9.9 38.0 4 200
18 1/2/2019 08:00 0.5 0.6 0.4 0.3 6.8 38.9 4 179
19 1/2/2019 09:00 0.5 0.2 0.0 0.2 3.0 39.0 4 193
20 1/2/2019 10:00 0.3 0.3 0.2 0.1 4.0 38.7 5 198
21 1/2/2019 11:00 0.3 0.3 0.2 0.0 4.9 38.4 5 170
22 1/2/2019 12:00 0.6 0.3 0.3 0.0 2.0 38.4 4 172
23 1/2/2019 13:00 0.2 0.3 0.2 0.0 2.0 38.8 4 154
24 1/2/2019 14:00 0.3 0.1 0.0 0.2 1.9 39.3 4 145
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('time') + ':00')
print (df)
date a b c d e f g h
0 2019-01-01 14:00:00 0.2 0.1 0.0 0.2 3.0 36.7 3 153
1 2019-01-01 15:00:00 0.2 0.6 0.2 0.4 3.9 36.7 1 199
2 2019-01-01 16:00:00 1.8 2.4 0.8 1.6 1.1 33.0 0 307
3 2019-01-01 17:00:00 3.0 3.2 0.6 2.6 6.0 32.8 1 310
4 2019-01-01 18:00:00 1.6 2.2 0.5 1.7 7.9 33.1 4 293
5 2019-01-01 19:00:00 1.7 1.1 0.6 0.6 5.9 35.0 5 262
6 2019-01-01 20:00:00 1.0 0.5 0.2 0.2 2.9 32.6 5 201
7 2019-01-01 21:00:00 0.6 0.3 0.0 0.4 2.1 31.8 6 182
8 2019-01-01 22:00:00 0.4 0.3 0.0 0.4 5.1 31.4 6 187
9 2019-01-01 23:00:00 0.8 0.6 0.3 0.3 9.9 30.2 5 227
10 2019-01-02 00:00:00 1.0 0.7 0.3 0.4 6.9 27.9 4 225
11 2019-01-02 01:00:00 1.3 0.9 0.5 0.4 4.0 26.9 6 236
12 2019-01-02 02:00:00 0.4 0.4 0.2 0.2 5.0 27.3 6 168
13 2019-01-02 03:00:00 0.7 0.5 0.3 0.3 6.9 30.2 4 219
14 2019-01-02 04:00:00 1.3 0.8 0.5 0.3 5.9 32.3 4 242
15 2019-01-02 05:00:00 0.7 0.2 0.0 0.2 3.0 33.8 4 177
16 2019-01-02 06:00:00 0.5 0.2 0.2 0.1 5.1 36.1 4 195
17 2019-01-02 07:00:00 0.6 0.3 0.2 0.2 9.9 38.0 4 200
18 2019-01-02 08:00:00 0.5 0.6 0.4 0.3 6.8 38.9 4 179
19 2019-01-02 09:00:00 0.5 0.2 0.0 0.2 3.0 39.0 4 193
20 2019-01-02 10:00:00 0.3 0.3 0.2 0.1 4.0 38.7 5 198
21 2019-01-02 11:00:00 0.3 0.3 0.2 0.0 4.9 38.4 5 170
22 2019-01-02 12:00:00 0.6 0.3 0.3 0.0 2.0 38.4 4 172
23 2019-01-02 13:00:00 0.2 0.3 0.2 0.0 2.0 38.8 4 154
24 2019-01-02 14:00:00 0.3 0.1 0.0 0.2 1.9 39.3 4 145

Python3 Pandas - How to combine multiple rows to one

Python Version:3.6
Pandas Version:0.21.1
How do I get from
print(df_raw)
device_id temp_a temp_b temp_c
0 0 0.2 0.8 0.6
1 0 0.1 0.9 0.4
2 1 0.3 0.7 0.2
3 2 0.5 0.5 0.1
4 2 0.1 0.9 0.4
5 2 0.7 0.3 0.9
to
print(df_except2)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Code of data:
df_raw = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
print(df_raw)
df_except = pd.DataFrame({'device_id' : ['0','1','2'],
'temp_a':[0.2,0.3,0.5],
'temp_b':[0.8,0.7,0.5],
'temp_c':[0.6,0.2,0.1],
'temp_a_1':[0.1,None,0.1],
'temp_b_1':[0.9,None,0.9],
'temp_c_1':[0.4,None,0.4],
'temp_a_2':[None,None,0.7],
'temp_b_2':[None,None,0.3],
'temp_c_2':[None,None,0.9],
})
df_except2 = df_except[['device_id','temp_a','temp_b','temp_c','temp_a_1','temp_b_1','temp_c_1','temp_a_2','temp_b_2','temp_c_2']]
print(df_except2)
Note:
1. Number of Multiple rows is unknow.
2. I refer to the following answer :
Pandas Dataframe - How to combine multiple rows to one
But this answer just can deal with one column.

Use:
g = df_raw.groupby('device_id').cumcount()
df = df_raw.set_index(['device_id', g]).unstack().sort_index(axis=1, level=1)
df.columns = ['{}_{}'.format(i,j) if j != 0 else '{}'.format(i) for i, j in df.columns]
df = df.reset_index()
print (df)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Explanation:
First count groups by cumcount by column device_id
Create MultiIndex by set_index and Series g
Reshape by unstack
Sort second level of MultiIndex in columns by sort_index
Change columns names by list comprehension
Last reset_index for column from index

code:
import numpy as np
device_id_list = df_raw['device_id'].tolist()
device_id_list = list(np.unique(device_id_list))
append_df = pd.DataFrame()
for device_id in device_id_list:
tmp_df = df_raw.query('device_id=="%s"'%(device_id))
if len(tmp_df)>1:
one_raw_list=[]
for i in range(0,len(tmp_df)):
one_raw_df = tmp_df.iloc[i:i+1]
one_raw_list.append(one_raw_df)
tmp_combine_df = pd.DataFrame()
for i in range(0,len(one_raw_list)-1):
next_raw = one_raw_list[i+1].drop(columns=['device_id']).reset_index(drop=True)
new_name_list=[]
for old_name in list(next_raw.columns):
new_name_list.append(old_name+'_'+str(i+1))
next_raw.columns = new_name_list
if i==0:
current_raw = one_raw_list[i].reset_index(drop=True)
tmp_combine_df = pd.concat([current_raw, next_raw], axis=1)
else:
tmp_combine_df = pd.concat([tmp_combine_df, next_raw], axis=1)
tmp_df = tmp_combine_df
tmp_df_columns = tmp_df.columns
append_df_columns = append_df.columns
append_df = pd.concat([append_df,tmp_df],ignore_index =True)
if len(tmp_df_columns) > len(append_df_columns):
append_df = append_df[tmp_df_columns]
else:
append_df = append_df[append_df_columns]
print(append_df)
Output:
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9

df = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
cols_of_interest = df.columns.drop('device_id')
df["C"] = "C_" + (df.groupby("device_id").cumcount() + 1).astype(str)
df.pivot_table(index="device_id", values=cols_of_interest, columns="C")
Output:
temp_a temp_b temp_c
C C_1 C_2 C_3 C_1 C_2 C_3 C_1 C_2 C_3
device_id
0 0.2 0.1 NaN 0.8 0.9 NaN 0.6 0.4 NaN
1 0.3 NaN NaN 0.7 NaN NaN 0.2 NaN NaN
2 0.5 0.1 0.7 0.5 0.9 0.3 0.1 0.4 0.9

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Update single column based on multiple 'priority' columns - python-3.x

Related

Comparing one month's value of current year with previous year values adding or substracting multiple parameters

Creating multiple cohort from the pivot table

Extract one column into multiple Column csv file

Issue with datetime formatting

Python3 Pandas - How to combine multiple rows to one

Categories

Resources