Update single column based on multiple 'priority' columns - python-3.x
Suppose you had a DataFrame with a number of columns / Series- say five for example. If the fifth column (named 'Updated Col') had values, in addition to nans, what would be the best way to insert values into 'Updated Col' from other columns in place of the nans based on a preferred column order?
e.g. my dataframe looks something like this;
Date 1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 Nan
12/03/2017 0:40 0.1 Nan
12/03/2017 0:50 0.6 0.5 Nan
12/03/2017 1:00 0.4 0.3 Nan
12/03/2017 1:10 0.3 0.2 Nan
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..and say for example I wanted the values from column 3 as a priority, followed by 2, then 1, i would expect the DataFrame to look like this;
1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 0.7
12/03/2017 0:40 0.1 0.1
12/03/2017 0:50 0.6 0.5 0.5
12/03/2017 1:00 0.4 0.3 0.3
12/03/2017 1:10 0.3 0.2 0.2
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..values would be input from the lower priority columns only if the higher priority columns were empty / NaN.
What would be the best way to do this?
I've tried numerous np.where attempts but cant work out what the best way would be?
Many thanks in advance.
You can use fillna with forward filling (ffill) and then select column:
updated_col = 'Updated Col'
#define columns for check, maybe [1,2,3,4] if integer colum names
cols = ['1','2','3','4'] + [updated_col]
print (df[cols].ffill(axis=1))
1 2 3 4 Updated Col
0 0.4 0.4 0.4 0.4 0.9
1 0.4 0.4 0.4 0.4 0.1
2 0.4 0.4 0.4 0.4 0.6
3 0.9 0.9 0.7 0.7 0.7
4 0.1 0.1 0.1 0.1 0.1
5 0.6 0.6 0.6 0.5 0.5
6 0.4 0.4 0.3 0.3 0.3
7 0.3 0.3 0.3 0.2 0.2
8 0.9 0.9 0.9 0.9 0.8
9 0.9 0.9 0.9 0.9 0.8
10 0.0 0.0 0.0 0.0 0.9
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
EDIT:
Thank you shivsn for comments.
If have Nan (string values) in DataFrame what are not NaNs (missing values) or empty string values is necessary first replace:
updated_col = 'Updated Col'
cols = ['1','2','3','4'] + ['Updated Col']
d = {'Nan':np.nan, '': np.nan}
df = df.replace(d)
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
Related
Comparing one month's value of current year with previous year values adding or substracting multiple parameters
Given a following dataframe df: date mom_pct 0 2020-1-31 1.4 1 2020-2-29 0.8 2 2020-3-31 -1.2 3 2020-4-30 -0.9 4 2020-5-31 -0.8 5 2020-6-30 -0.1 6 2020-7-31 0.6 7 2020-8-31 0.4 8 2020-9-30 0.2 9 2020-10-31 -0.3 10 2020-11-30 -0.6 11 2020-12-31 0.7 12 2021-1-31 1.0 13 2021-2-28 0.6 14 2021-3-31 -0.5 15 2021-4-30 -0.3 16 2021-5-31 -0.2 17 2021-6-30 -0.4 18 2021-7-31 0.3 19 2021-8-31 0.1 20 2021-9-30 0.0 21 2021-10-31 0.7 22 2021-11-30 0.4 23 2021-12-31 -0.3 24 2022-1-31 0.4 25 2022-2-28 0.6 26 2022-3-31 0.0 27 2022-4-30 0.4 28 2022-5-31 -0.2 I want to compare the chain ratio value of a month of the current year to the value of the month of the previous year. Assume that the value of the same period last year is y_t-1, and the current value of this year is y_t. I will create a new column according to the following rules: If y_t = y_t-1, returns 0 for new column; If y_t ∈ (y_t-1, y_t-1 + 0.3], returns 1; If y_t ∈ (y_t-1 + 0.3, y_t-1 + 0.5], returns 2; If y_t > (y_t-1 + 0.5), returns 3; If y_t ∈ [y_t-1 - 0.3, y_t-1), returns -1; If y_t ∈ [y_t-1 - 0.5, y_t-1 - 0.3), returns -2; If y_t < (y_t-1 - 0.5), returns -3 The expected result: date mom_pct categorial_mom_pct 0 2020-1-31 1.0 NaN 1 2020-2-29 0.8 NaN 2 2020-3-31 -1.2 NaN 3 2020-4-30 -0.9 NaN 4 2020-5-31 -0.8 NaN 5 2020-6-30 -0.1 NaN 6 2020-7-31 0.6 NaN 7 2020-8-31 0.4 NaN 8 2020-9-30 0.2 NaN 9 2020-10-31 -0.3 NaN 10 2020-11-30 -0.6 NaN 11 2020-12-31 0.7 NaN 12 2021-1-31 1.0 0.0 13 2021-2-28 0.6 -1.0 14 2021-3-31 -0.5 3.0 15 2021-4-30 -0.3 3.0 16 2021-5-31 -0.2 3.0 17 2021-6-30 -0.4 -1.0 18 2021-7-31 0.3 -1.0 19 2021-8-31 0.1 -1.0 20 2021-9-30 0.0 -1.0 21 2021-10-31 0.7 3.0 22 2021-11-30 0.4 3.0 23 2021-12-31 -0.3 -3.0 24 2022-1-31 0.4 -3.0 25 2022-2-28 0.6 0.0 26 2022-3-31 0.0 2.0 27 2022-4-30 0.4 3.0 28 2022-5-31 -0.2 0.0 I attempt to create multiple columns and ranges, then check mom_pct is in which range. Is it possible to do that in a more effecient way? Thanks. df1['mom_pct_zero'] = df1['mom_pct'].shift(12) df1['mom_pct_pos1'] = df1['mom_pct'].shift(12) + 0.3 df1['mom_pct_pos2'] = df1['mom_pct'].shift(12) + 0.5 df1['mom_pct_neg1'] = df1['mom_pct'].shift(12) - 0.3 df1['mom_pct_neg2'] = df1['mom_pct'].shift(12) - 0.5
I would do it as follows def categorize(v): if np.isnan(v) or v == 0.: return v sign = -1 if v < 0 else 1 eps = 1e-10 if abs(v) <= 0.3 + eps: return sign * 1 if abs(v) <= 0.5 + eps: return sign * 2 return sign * 3 df['categorial_mom_pct'] = df['mom_pct'].diff(12).map(categorize) print(df) Note that I added a very small eps to the threshold to counter the precision issue with floating point arithmetic abs(-0.3) <= 0.3 # True abs(-0.4 + 0.1) <= 0.3 # False abs(-0.4 + 0.1) <= 0.3 + 1e-10 # True Out: date mom_pct categorial_mom_pct 0 2020-1-31 1.0 NaN 1 2020-2-29 0.8 NaN 2 2020-3-31 -1.2 NaN 3 2020-4-30 -0.9 NaN 4 2020-5-31 -0.8 NaN 5 2020-6-30 -0.1 NaN 6 2020-7-31 0.6 NaN 7 2020-8-31 0.4 NaN 8 2020-9-30 0.2 NaN 9 2020-10-31 -0.3 NaN 10 2020-11-30 -0.6 NaN 11 2020-12-31 0.7 NaN 12 2021-1-31 1.0 0.0 13 2021-2-28 0.6 -1.0 14 2021-3-31 -0.5 3.0 15 2021-4-30 -0.3 3.0 16 2021-5-31 -0.2 3.0 17 2021-6-30 -0.4 -1.0 18 2021-7-31 0.3 -1.0 19 2021-8-31 0.1 -1.0 20 2021-9-30 0.0 -1.0 21 2021-10-31 0.7 3.0 22 2021-11-30 0.4 3.0 23 2021-12-31 -0.3 -3.0 24 2022-1-31 0.4 -3.0 25 2022-2-28 0.6 0.0 26 2022-3-31 0.0 2.0 27 2022-4-30 0.4 3.0 28 2022-5-31 -0.2 0.0
Creating multiple cohort from the pivot table
I have a requirement like below. The initial information is a list of gross adds. 201910 201911 201912 202001 202002 20000 30000 32000 40000 36000 I have a pivot table as below. 201910 201911 201912 202001 202002 1000 2000 2400 3200 1800 500 400 300 200 nan 200 150 100 nan nan 200 100 nan nan nan 160 nan nan nan nan Need to generate the report like below. Cohort01: 5% 3% 3% 1% 1% 1% From Cohort02 onwards it will take the average of last value of cohort01. Similarly for Cohort03 for both nan values it will take the average of corresponding values of cohort01 and cohort2. Again while calculating for cohort04 it will take the average of previous two cohorts(cohort02 and cohort03 values) to add all three nan value. Is there anyone who can provide me a solution on this in Python. The report should be generated as below. All cohorts should be created separately.
You could try it like this: res = df.apply(lambda x: round(100/(df_gross.iloc[0]/x),1),axis=1) print(res) 201910 201911 201912 202001 202002 0 5.0 6.7 7.5 8.0 5.0 1 2.5 1.3 0.9 0.5 NaN 2 1.0 0.5 0.3 NaN NaN 3 1.0 0.3 NaN NaN NaN 4 0.8 NaN NaN NaN NaN for idx,col in enumerate(res.columns[1:],1): res[col] = res[col].fillna((res.iloc[:,max(idx-2,0)]+res.iloc[:,idx-1])/2) print(res) 201910 201911 201912 202001 202002 0 5.0 6.7 7.50 8.000 5.0000 1 2.5 1.3 0.90 0.500 0.7000 2 1.0 0.5 0.30 0.400 0.3500 3 1.0 0.3 0.65 0.475 0.5625 4 0.8 0.8 0.80 0.800 0.8000
Extract one column into multiple Column csv file
my credit credit_scoring.csv is like this how can i make it in an organised way 14 column and each column has it's corresponding value Seniority;Home;Time;Age;Marital;Records;Job;Expenses;Income;Assets;Debt;Amount;Price;Status 0 9.0;1.0;60.0;30.0;0.0;1.0;1.0;73.0;129.0;0.0;0... 1 17.0;1.0;60.0;58.0;1.0;1.0;0.0;48.0;131.0;0.0;... 2 10.0;0.0;36.0;46.0;0.0;2.0;1.0;90.0;200.0;3000... 3 0.0;1.0;60.0;24.0;1.0;1.0;0.0;63.0;182.0;2500.... 4 0.0;1.0;36.0;26.0;1.0;1.0;0.0;46.0;107.0;0.0;0... . ................................................. . ................................................. . ................................................. . .................................................
You can simply use read_csv() with sep=';' Your example data isn't great, but I tried to do the most of it. I saved it as a.csv and here is the code: In [1]: import pandas as pd In [2]: pd.read_csv('a.csv', sep=';') Out[2]: Seniority Home Time Age Marital Records Job Expenses Income Assets Debt Amount Price Status 0 9.0 1.0 60.0 30.0 0.0 1.0 1.0 73.0 129.0 0.0 0.0 NaN NaN NaN 1 17.0 1.0 60.0 58.0 1.0 1.0 0.0 48.0 131.0 0.0 NaN NaN NaN NaN 2 10.0 0.0 36.0 46.0 0.0 2.0 1.0 90.0 200.0 3000.0 NaN NaN NaN NaN 3 0.0 1.0 60.0 24.0 1.0 1.0 0.0 63.0 182.0 2500.0 NaN NaN NaN NaN 4 0.0 1.0 36.0 26.0 1.0 1.0 0.0 46.0 107.0 0.0 0.0 NaN NaN NaN
Issue with datetime formatting
I am having an issue with the datetime format of a set of data. The issue is due to the hour of day ranging from 1-24, with the 24th hour set to the wrong day (more specifically, the previous day). I have a sample of the data below, 1/1/2019,14:00,0.2,0.1,0.0,0.2,3.0,36.7,3,153 1/1/2019,15:00,0.2,0.6,0.2,0.4,3.9,36.7,1,199 1/1/2019,16:00,1.8,2.4,0.8,1.6,1.1,33.0,0,307 1/1/2019,17:00,3.0,3.2,0.6,2.6,6.0,32.8,1,310 1/1/2019,18:00,1.6,2.2,0.5,1.7,7.9,33.1,4,293 1/1/2019,19:00,1.7,1.1,0.6,0.6,5.9,35.0,5,262 1/1/2019,20:00,1.0,0.5,0.2,0.2,2.9,32.6,5,201 1/1/2019,21:00,0.6,0.3,0.0,0.4,2.1,31.8,6,182 1/1/2019,22:00,0.4,0.3,0.0,0.4,5.1,31.4,6,187 1/1/2019,23:00,0.8,0.6,0.3,0.3,9.9,30.2,5,227 1/1/2019,24:00,1.0,0.7,0.3,0.4,6.9,27.9,4,225 --- Here the date should be 1/2/2019 1/2/2019,01:00,1.3,0.9,0.5,0.4,4.0,26.9,6,236 1/2/2019,02:00,0.4,0.4,0.2,0.2,5.0,27.3,6,168 1/2/2019,03:00,0.7,0.5,0.3,0.3,6.9,30.2,4,219 1/2/2019,04:00,1.3,0.8,0.5,0.3,5.9,32.3,4,242 1/2/2019,05:00,0.7,0.2,0.0,0.2,3.0,33.8,4,177 1/2/2019,06:00,0.5,0.2,0.2,0.1,5.1,36.1,4,195 1/2/2019,07:00,0.6,0.3,0.2,0.2,9.9,38.0,4,200 1/2/2019,08:00,0.5,0.6,0.4,0.3,6.8,38.9,4,179 1/2/2019,09:00,0.5,0.2,0.0,0.2,3.0,39.0,4,193 1/2/2019,10:00,0.3,0.3,0.2,0.1,4.0,38.7,5,198 1/2/2019,11:00,0.3,0.3,0.2,0.0,4.9,38.4,5,170 1/2/2019,12:00,0.6,0.3,0.3,0.0,2.0,38.4,4,172 1/2/2019,13:00,0.2,0.3,0.2,0.0,2.0,38.8,4,154 1/2/2019,14:00,0.3,0.1,0.0,0.2,1.9,39.3,4,145 This is a fairly large set of data which I need to make a time series plot of, and as such I need to find a way to fix this formatting issue. I was attempting to iterate through the rows and in a pandas dataframe to fix problematic rows, but this does not provide any results. Thank you for any help beforehand.
You can convert date to datetimes by to_datetime and then add time column converted to timedeltas by to_timedelta: df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['time'] + ':00') Or if need remove time column also: print (df) date time a b c d e f g h 0 1/1/2019 14:00 0.2 0.1 0.0 0.2 3.0 36.7 3 153 1 1/1/2019 15:00 0.2 0.6 0.2 0.4 3.9 36.7 1 199 2 1/1/2019 16:00 1.8 2.4 0.8 1.6 1.1 33.0 0 307 3 1/1/2019 17:00 3.0 3.2 0.6 2.6 6.0 32.8 1 310 4 1/1/2019 18:00 1.6 2.2 0.5 1.7 7.9 33.1 4 293 5 1/1/2019 19:00 1.7 1.1 0.6 0.6 5.9 35.0 5 262 6 1/1/2019 20:00 1.0 0.5 0.2 0.2 2.9 32.6 5 201 7 1/1/2019 21:00 0.6 0.3 0.0 0.4 2.1 31.8 6 182 8 1/1/2019 22:00 0.4 0.3 0.0 0.4 5.1 31.4 6 187 9 1/1/2019 23:00 0.8 0.6 0.3 0.3 9.9 30.2 5 227 10 1/1/2019 24:00 1.0 0.7 0.3 0.4 6.9 27.9 4 225 11 1/2/2019 01:00 1.3 0.9 0.5 0.4 4.0 26.9 6 236 12 1/2/2019 02:00 0.4 0.4 0.2 0.2 5.0 27.3 6 168 13 1/2/2019 03:00 0.7 0.5 0.3 0.3 6.9 30.2 4 219 14 1/2/2019 04:00 1.3 0.8 0.5 0.3 5.9 32.3 4 242 15 1/2/2019 05:00 0.7 0.2 0.0 0.2 3.0 33.8 4 177 16 1/2/2019 06:00 0.5 0.2 0.2 0.1 5.1 36.1 4 195 17 1/2/2019 07:00 0.6 0.3 0.2 0.2 9.9 38.0 4 200 18 1/2/2019 08:00 0.5 0.6 0.4 0.3 6.8 38.9 4 179 19 1/2/2019 09:00 0.5 0.2 0.0 0.2 3.0 39.0 4 193 20 1/2/2019 10:00 0.3 0.3 0.2 0.1 4.0 38.7 5 198 21 1/2/2019 11:00 0.3 0.3 0.2 0.0 4.9 38.4 5 170 22 1/2/2019 12:00 0.6 0.3 0.3 0.0 2.0 38.4 4 172 23 1/2/2019 13:00 0.2 0.3 0.2 0.0 2.0 38.8 4 154 24 1/2/2019 14:00 0.3 0.1 0.0 0.2 1.9 39.3 4 145 df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('time') + ':00') print (df) date a b c d e f g h 0 2019-01-01 14:00:00 0.2 0.1 0.0 0.2 3.0 36.7 3 153 1 2019-01-01 15:00:00 0.2 0.6 0.2 0.4 3.9 36.7 1 199 2 2019-01-01 16:00:00 1.8 2.4 0.8 1.6 1.1 33.0 0 307 3 2019-01-01 17:00:00 3.0 3.2 0.6 2.6 6.0 32.8 1 310 4 2019-01-01 18:00:00 1.6 2.2 0.5 1.7 7.9 33.1 4 293 5 2019-01-01 19:00:00 1.7 1.1 0.6 0.6 5.9 35.0 5 262 6 2019-01-01 20:00:00 1.0 0.5 0.2 0.2 2.9 32.6 5 201 7 2019-01-01 21:00:00 0.6 0.3 0.0 0.4 2.1 31.8 6 182 8 2019-01-01 22:00:00 0.4 0.3 0.0 0.4 5.1 31.4 6 187 9 2019-01-01 23:00:00 0.8 0.6 0.3 0.3 9.9 30.2 5 227 10 2019-01-02 00:00:00 1.0 0.7 0.3 0.4 6.9 27.9 4 225 11 2019-01-02 01:00:00 1.3 0.9 0.5 0.4 4.0 26.9 6 236 12 2019-01-02 02:00:00 0.4 0.4 0.2 0.2 5.0 27.3 6 168 13 2019-01-02 03:00:00 0.7 0.5 0.3 0.3 6.9 30.2 4 219 14 2019-01-02 04:00:00 1.3 0.8 0.5 0.3 5.9 32.3 4 242 15 2019-01-02 05:00:00 0.7 0.2 0.0 0.2 3.0 33.8 4 177 16 2019-01-02 06:00:00 0.5 0.2 0.2 0.1 5.1 36.1 4 195 17 2019-01-02 07:00:00 0.6 0.3 0.2 0.2 9.9 38.0 4 200 18 2019-01-02 08:00:00 0.5 0.6 0.4 0.3 6.8 38.9 4 179 19 2019-01-02 09:00:00 0.5 0.2 0.0 0.2 3.0 39.0 4 193 20 2019-01-02 10:00:00 0.3 0.3 0.2 0.1 4.0 38.7 5 198 21 2019-01-02 11:00:00 0.3 0.3 0.2 0.0 4.9 38.4 5 170 22 2019-01-02 12:00:00 0.6 0.3 0.3 0.0 2.0 38.4 4 172 23 2019-01-02 13:00:00 0.2 0.3 0.2 0.0 2.0 38.8 4 154 24 2019-01-02 14:00:00 0.3 0.1 0.0 0.2 1.9 39.3 4 145
Python3 Pandas - How to combine multiple rows to one
Python Version:3.6 Pandas Version:0.21.1 How do I get from print(df_raw) device_id temp_a temp_b temp_c 0 0 0.2 0.8 0.6 1 0 0.1 0.9 0.4 2 1 0.3 0.7 0.2 3 2 0.5 0.5 0.1 4 2 0.1 0.9 0.4 5 2 0.7 0.3 0.9 to print(df_except2) device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \ 0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN 1 1 0.3 0.7 0.2 NaN NaN NaN NaN 2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7 temp_b_2 temp_c_2 0 NaN NaN 1 NaN NaN 2 0.3 0.9 Code of data: df_raw = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'], 'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7], 'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3], 'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9], }) print(df_raw) df_except = pd.DataFrame({'device_id' : ['0','1','2'], 'temp_a':[0.2,0.3,0.5], 'temp_b':[0.8,0.7,0.5], 'temp_c':[0.6,0.2,0.1], 'temp_a_1':[0.1,None,0.1], 'temp_b_1':[0.9,None,0.9], 'temp_c_1':[0.4,None,0.4], 'temp_a_2':[None,None,0.7], 'temp_b_2':[None,None,0.3], 'temp_c_2':[None,None,0.9], }) df_except2 = df_except[['device_id','temp_a','temp_b','temp_c','temp_a_1','temp_b_1','temp_c_1','temp_a_2','temp_b_2','temp_c_2']] print(df_except2) Note: 1. Number of Multiple rows is unknow. 2. I refer to the following answer : Pandas Dataframe - How to combine multiple rows to one But this answer just can deal with one column.
Use: g = df_raw.groupby('device_id').cumcount() df = df_raw.set_index(['device_id', g]).unstack().sort_index(axis=1, level=1) df.columns = ['{}_{}'.format(i,j) if j != 0 else '{}'.format(i) for i, j in df.columns] df = df.reset_index() print (df) device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \ 0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN 1 1 0.3 0.7 0.2 NaN NaN NaN NaN 2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7 temp_b_2 temp_c_2 0 NaN NaN 1 NaN NaN 2 0.3 0.9 Explanation: First count groups by cumcount by column device_id Create MultiIndex by set_index and Series g Reshape by unstack Sort second level of MultiIndex in columns by sort_index Change columns names by list comprehension Last reset_index for column from index
code: import numpy as np device_id_list = df_raw['device_id'].tolist() device_id_list = list(np.unique(device_id_list)) append_df = pd.DataFrame() for device_id in device_id_list: tmp_df = df_raw.query('device_id=="%s"'%(device_id)) if len(tmp_df)>1: one_raw_list=[] for i in range(0,len(tmp_df)): one_raw_df = tmp_df.iloc[i:i+1] one_raw_list.append(one_raw_df) tmp_combine_df = pd.DataFrame() for i in range(0,len(one_raw_list)-1): next_raw = one_raw_list[i+1].drop(columns=['device_id']).reset_index(drop=True) new_name_list=[] for old_name in list(next_raw.columns): new_name_list.append(old_name+'_'+str(i+1)) next_raw.columns = new_name_list if i==0: current_raw = one_raw_list[i].reset_index(drop=True) tmp_combine_df = pd.concat([current_raw, next_raw], axis=1) else: tmp_combine_df = pd.concat([tmp_combine_df, next_raw], axis=1) tmp_df = tmp_combine_df tmp_df_columns = tmp_df.columns append_df_columns = append_df.columns append_df = pd.concat([append_df,tmp_df],ignore_index =True) if len(tmp_df_columns) > len(append_df_columns): append_df = append_df[tmp_df_columns] else: append_df = append_df[append_df_columns] print(append_df) Output: device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \ 0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN 1 1 0.3 0.7 0.2 NaN NaN NaN NaN 2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7 temp_b_2 temp_c_2 0 NaN NaN 1 NaN NaN 2 0.3 0.9
df = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'], 'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7], 'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3], 'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9], }) cols_of_interest = df.columns.drop('device_id') df["C"] = "C_" + (df.groupby("device_id").cumcount() + 1).astype(str) df.pivot_table(index="device_id", values=cols_of_interest, columns="C") Output: temp_a temp_b temp_c C C_1 C_2 C_3 C_1 C_2 C_3 C_1 C_2 C_3 device_id 0 0.2 0.1 NaN 0.8 0.9 NaN 0.6 0.4 NaN 1 0.3 NaN NaN 0.7 NaN NaN 0.2 NaN NaN 2 0.5 0.1 0.7 0.5 0.9 0.3 0.1 0.4 0.9