Tidying dataframe with one parameter per row - python-3.x

I have a dataframe like this:
d = {'Date': ['2020-10-09', '2020-10-09', '2020-10-09', '2020-10-10', '2020-10-10', '2020-10-10', '2020-10-11', '2020-10-11', '2020-10-11'],
'ID': ['T1', 'T2', 'T3', 'T1', 'T2', 'T3','T1', 'T2', 'T3'],
'Value': [13, 12, 11, 14, 15, 16, 20, 21, 22]}
df = pd.DataFrame(data=d)
df
Date ID Value
0 2020-10-09 T1 13
1 2020-10-09 T2 12
2 2020-10-09 T3 11
3 2020-10-10 T1 14
4 2020-10-10 T2 15
5 2020-10-10 T3 16
6 2020-10-11 T1 20
7 2020-10-11 T2 21
8 2020-10-11 T3 22
And I'm trying to get:
d = {'Date': ['2020-10-09', '2020-10-10', '2020-10-11'],
'Value T1': ['13', '14', '20'],
'Value T2': ['12', '15', '21'],
'Value T3': ['11', '15', '22']}
df = pd.DataFrame(data=d)
df
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 15
2 2020-10-11 20 21 22
I tried with pivot but I got the error:
"Index contains duplicate entries, cannot reshape"

Use pd.pivot_table like shown below,
pdf = pd.pivot_table(
df,
values=['Value'],
index=['Date'],
columns=['ID'],
aggfunc='first'
).reset_index(drop=False)
pdf.columns = ['Date', "Value T1", "Value T2", "Value T3"]
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 16
2 2020-10-11 20 21 22
Note that aggfunc is first here. Which means if there is multiple values for a given ID at given Date, You'll get the first value in the dataframe. You can change it to min/max/last as per you need

Related

Range mapping in Python

MAPPER DATAFRAME
col_data = {'p0_tsize_qbin_':[1, 2, 3, 4, 5] ,
'p0_tsize_min':[0.0, 7.0499999999999545, 16.149999999999977, 32.65000000000009, 76.79999999999973] ,
'p0_tsize_max':[7.0, 16.100000000000023, 32.64999999999998, 76.75, 6759.850000000006]}
map_df = pd.DataFrame(col_data, columns = ['p0_tsize_qbin_', 'p0_tsize_min','p0_tsize_max'])
map_df
in Above data frame is map_df where column 2 and column 3 is the range and column1 is mapper value to the new data frame .
MAIN DATAFRAME
raw_data = {
'id': ['1', '2', '2', '3', '3','1', '2', '2', '3', '3','1', '2', '2', '3', '3'],
'val' : [3, 56, 78, 11, 5000,37, 756, 78, 49, 21,9, 4, 14, 75, 31,]}
df = pd.DataFrame(raw_data, columns = ['id', 'val','p0_tsize_qbin_mapped'])
df
EXPECTED OUTPUT MARKED IN BLUE
look for val of df dataframe in map_df min(column1) and max(columns2) where ever it lies get the p0_tsize_qbin_ value.
For Example : from df data frame val = 3 , lies in the range of p0_tsize_min p0_tsize_max where p0_tsize_qbin_ ==1 . so 1 will return
Try using pd.cut()
bins = map_df['p0_tsize_min'].tolist() + [map_df['p0_tsize_max'].max()]
labels = map_df['p0_tsize_qbin_'].tolist()
df.assign(p0_tsize_qbin_mapped = pd.cut(df['val'],bins = bins,labels = labels))
or
bins = pd.IntervalIndex.from_arrays(map_df['p0_tsize_min'],map_df['p0_tsize_max'])
map_df.loc[bins.get_indexer(df['val'].tolist()),'p0_tsize_qbin_'].to_numpy()
Output:
id val p0_tsize_qbin_mapped
0 1 3 1
1 2 56 4
2 2 78 5
3 3 11 2
4 3 5000 5
5 1 37 4
6 2 756 5
7 2 78 5
8 3 49 4
9 3 21 3
10 1 9 2
11 2 4 1
12 2 14 2
13 3 75 4
14 3 31 3

How to resolve following key error in the my code while manipulating dataframe?

Lets say i have this data frame. here i want to match a date value and will compare if at this current index and at next index employee id is same or not if they are same then i want to delete the row at next index. i tried with below code but getting key error:1
EMPL_ID
AGE
EMPLOYER
END_DATE
12
23
BHU
2022-04-22 00:00:00
12
21
BHU
2022-04-22 00:00:00
34
22
DU
2022-04-22 00:00:00
36
21
BHU
2022-04-22 00:00:00
for index, row in df.iterrows():
value = row['END_DATE']
if (value == '2022-04-22 00:00:00'):
a = index
if (df.loc[a, 'EMPL_ID'] == df.loc[a+1, 'EMPL_ID']):
df.drop(a+1, inplace = True)
else:
df = df
else:
df = df
Keeping only the unique values based on two columns can be performed using the function drop_duplicates, while supplying the columns that make up the subset:
df = pd.DataFrame(data={'id': [1, 2, 3, 3], 'time': [1, 2, 3, 3]})
>>> df
id time
0 1 1
1 2 2
2 3 3
3 3 3
df.drop_duplicates(subset=['id', 'time'], inplace=True)
>>> df
id time
0 1 1
1 2 2
2 3 3
IIUC here is one approach. Since you only search for consecutive duplicates I added another row to your example. The 2nd row of "EMPL_ID" 34 won't be deleted.
df = pd.DataFrame({'EMPL_ID': [12, 12, 34, 36, 34],
'AGE': [23, 21, 22, 21, 22],
'EMPLOYER': ['BHU', 'BHU', 'DU', 'BHU', 'DU'],
'END_DATE': ['2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00'],
})
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
1 12 21 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00
grp = ['END_DATE', 'EMPL_ID']
df['group'] = (df[grp] != df[grp].shift(1)).any(axis=1).cumsum()
df = df.drop_duplicates('group', keep= 'first').drop('group', axis=1)
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00

Get sum of group subset using pandas groupby

I have a dataframe as shown. Using python, I want to get the sum of 'Value' for each 'Id' group upto the first occurrence of 'Stage' 12.
df = pd.DataFrame({'Id':[1,1,1,2,2,2,2],
'Date': ['2020-04-23', '2020-04-25', '2020-04-28', '2020-04-20', '2020-05-01', '2020-05-05', '2020-05-12'],
'Stage': [11, 12, 15, 11, 14, 12, 12],
'Value': [5, 4, 6, 12, 2, 8, 3]})
Id Date Stage Value
1 2020-04-23 11 5
1 2020-04-25 12 4
1 2020-04-28 15 6
2 2020-04-20 11 12
2 2020-05-01 14 2
2 2020-08-05 12 8
2 2020-05-12 12 3
My desired output:
Id Value
1 9
2 22
Would be very thankful if someone could help.
Let us try use the groupby transform idxmax filter the dataframe , then do another round of groupby
idx = df['Stage'].eq(12).groupby(df['id']).transform('idxmax')
output = df[df.index <= idx].groupby('id')['Value'].sum().reset_index()
Detail
the transform with idxmax will return the first index match with 12 for all the groupby row, then we need to filter the df with index less than that to get the data until the first 12 show up.

pandas move to correspondent column based on value of other column

Im trying to move the f1_am, f2_am, f3_am to the correspondent column based on the values of f1_ty, f2_ty, f3_ty
I started adding new columns to the dataframe based on unique values from the _ty using sets, but I'm trying to figure it out how to move the _am values to were it belongs
Looked for the option of group by and pivot but the result exploded my mind....
I would appreciate some guidance.
Below the code.
import pandas as pd
import numpy as np
data = {
'mem_id': ['A', 'B', 'C', 'A', 'B', 'C']
, 'date_inf': ['01/01/2019', '01/01/2019', '01/01/2019', '02/01/2019', '02/01/2019', '02/01/2019']
, 'f1_ty': ['ABC', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI']
, 'f1_am': [100, 20, 57, 44, 15, 10]
, 'f2_ty': ['DEF', 'DEF', 'DEF', 'GHI', 'ABC', 'XYZ']
, 'f2_am':[20, 30, 45, 66, 14, 21]
, 'f3_ty': ['XYZ', 'GHI', 'OPQ', 'OPQ', 'XYZ', 'DEF']
, 'f3_am':[20, 30, 45, 66, 14, 21]
}
df = pd.DataFrame (data)
#distinct values in columns using sets
distinct_values = sorted(list(set(df['f1_ty'])|set(df['f2_ty'])|set(df['f3_ty'])))
# add distinct values as new columns in the DataFrame
new_df = df.reindex(columns = np.append( df.columns.values, distinct_values))
So this would be my starting point and my wanted result.
Here is a try, thanks for the interesting problem (rename colujmns to make compatible to wide_to_long() followed by unstack() while dropping extra levels:
m=df.set_index(['mem_id','date_inf']).rename(columns=lambda x: ''.join(x.split('_')[::-1]))
n=(pd.wide_to_long(m.reset_index(),['tyf','amf'],['mem_id','date_inf'],'v')
.droplevel(-1).set_index('tyf',append=True).unstack(fill_value=0).reindex(m.index))
final=n.droplevel(0,axis=1).rename_axis(None,axis=1).reset_index()
print(final)
mem_id date_inf ABC DEF GHI OPQ XYZ
0 A 01/01/2019 100 20 0 0 20
1 B 01/01/2019 20 30 30 0 0
2 C 01/01/2019 57 45 0 45 0
3 A 02/01/2019 44 0 66 66 0
4 B 02/01/2019 14 0 15 0 14
5 C 02/01/2019 0 21 10 0 21

Python - sum of each day in multiple excel sheet

I have a excel sheets data like below,
Sheet1
duration date
10 5/20/2017 08:20
23 5/20/2017 10:20
33 5/21/2017 12:20
56 5/22/2017 23:20
Sheet2
duration date
34 5/20/2017 01:20
12 5/20/2017 03:20
05 5/21/2017 11:20
44 5/22/2017 23:20
Expected OP :
day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]
I am trying to sum of duration day wise in all sheets like below code,
xls = pd.ExcelFile('reports.xlsx')
report_sheets = []
for sheetName in xls.sheet_names:
sheet = pd.read_excel(xls,sheet_name=sheetName)
sheet['date'] = pd.to_datetime(sheet['date'])
print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())
How can I achieve this?
You can use parameter sheet_name=False to read_excel for return dictionary of DataFrames:
dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1', duration date
0 10 5/20/2017 08:20
1 23 5/20/2017 10:20
2 33 5/21/2017 12:20
3 56 5/22/2017 23:20), ('Sheet2', duration date
0 34 5/20/2017 01:20
1 12 5/20/2017 03:20
2 5 5/21/2017 11:20
3 44 5/22/2017 23:20)])
Then aggregate in dictionary comprehension:
dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20 46
2017-05-21 5
2017-05-22 44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20 33
2017-05-21 33
2017-05-22 56
Name: duration, dtype: int64}
And last concat, create lists and last dictionary by to_dict:
d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}
Make a function that takes the sheet's dataframe and returns a dictionary
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
Then use merge_with from either toolz or cytoolz
from cytoolz.dicttoolz import merge_with
merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))
{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}
details
print(sheet1, sheet2, sep='\n\n')
duration date
0 10 2017-05-20 08:20:00
1 23 2017-05-20 10:20:00
2 33 2017-05-21 12:20:00
3 56 2017-05-22 23:20:00
duration date
0 34 2017-05-20 01:20:00
1 12 2017-05-20 03:20:00
2 5 2017-05-21 11:20:00
3 44 2017-05-22 23:20:00
For your problem
I'd do this
from cytoolz.dicttoolz import merge_with
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
def read_sheet(xls, sn):
return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])
xls = pd.ExcelFile('reports.xlsx')
sheet_dict = merge_with(
lambda x: sum(x, []),
map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)

Resources