This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 1 year ago.
I tried to sum up multiple rows excluding Hour and Date row, But i get the error as
"Value Error: cannot join with no overlapping index names"
Exact data
Hour Input Date Total DA DB CA CB X Y Z Z1 Z2
0 A 9/23/2021 14570 6816 636 6821 297 14213 335 9 13 0
0 B 9/23/2021 147864 63746 10186 63746 10186 147821 0 42 1 0
1 A 9/23/2021 126681 63180 191 63178 132 126606 34 36 5 0
1 B 9/23/2021 33119 1 16558 1 16559 33106 0 13 0 0
2 A 9/23/2021 11550 5398 653 5395 104 10991 549 2 8 0
2 B 9/23/2021 25197 0 12599 0 12598 25176 0 21 0 0
3 A 9/23/2021 259 0 157 0 102 204 55 0 0 0
3 B 9/23/2021 14379 0 7189 0 7190 14347 0 32 0 0
Required output
Hour Input Date Total DA DB CA CB X Y Z Z1 Z2
0 A 9/23/2021 162434 70562 10822 70567 10483 162034 335 51 14 0
1 A 9/23/2021 159800 63181 16749 63179 16691 159712 34 49 5 0
2 A 9/23/2021 36747 5398 13252 5395 12702 36167 549 23 8 0
3 A 9/23/2021 14638 0 7346 0 7292 14551 55 32 0 0
I used the following script:
column_list = list(df_output)
column_list.remove('Hour')
df_output = df[df_output].sum(axis=1)
IIUC use:
df_output = df.groupby(['Input','Date'], as_index=False).sum()
Related
I am a python beginner.
I have the following pandas DataFrame, with only two columns; "Time" and "Input".
I want to loop over the "Input" column. Assuming we have a window size w= 3. (three consecutive values) such that for every selected window, we will check if all the items/elements within that window are 1's, then return the first item as 1 and change the remaining values to 0's.
index Time Input
0 11 0
1 22 0
2 33 0
3 44 1
4 55 1
5 66 1
6 77 0
7 88 0
8 99 0
9 1010 0
10 1111 1
11 1212 1
12 1313 1
13 1414 0
14 1515 0
My intended output is as follows
index Time Input What_I_got What_I_Want
0 11 0 0 0
1 22 0 0 0
2 33 0 0 0
3 44 1 1 1
4 55 1 1 0
5 66 1 1 0
6 77 1 1 1
7 88 1 0 0
8 99 1 0 0
9 1010 0 0 0
10 1111 1 1 1
11 1212 1 0 0
12 1313 1 0 0
13 1414 0 0 0
14 1515 0 0 0
What should I do to get the desired output? Am I missing something in my code?
import pandas as pd
import re
pd.Series(list(re.sub('111', '100', ''.join(df.Input.astype(str))))).astype(int)
Out[23]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 1
11 0
12 0
13 0
14 0
dtype: int32
I have a df as shown below
df:
ID Limit N_30 N_31_90 N_91_180 N_180_365
1 500 60 15 30 1
2 300 0 15 5 10
3 800 0 0 10 6
4 100 0 0 0 370
5 600 0 6 5 10
6 800 0 0 15 6
7 500 10 10 30 9
8 200 0 0 0 0
About the data
ID - customer ID
Limit - Limit
N_30 - Number of transaction in last 30 days
N_31_90 - Number of transaction in last 31 to 90 days.
N_91_180 - Number of transaction in last 91 to 180 days.
N_180_365 - Number of transaction in last 281 to 365 days.
From the above df I would like to extract a column called Recency.
Explanation:
if df['N_30'] != 0, then Recency = (30/df['N_30'])
elif df['N_31_90'] != 0 then Recency = 30 + (60/df['N_31_90'])
elif df['N_91_180'] != 0 then Recency = 90 + (90/df['N_91_180'])
elif df['N_181_365'] != 0 then Recency = 180 + (185/df['N_181_365'])
else Recency = 730
Expected output:
ID Limit N_30 N_31_90 N_91_180 N_180_365 Recency
1 500 60 15 30 1 (30/60) = 0.5
2 300 0 15 5 10 30+(60/15) = 34
3 800 0 0 10 6 90+90/10 = 100
4 100 0 0 0 370 180+(185/370) = 180.5
5 600 0 6 5 10 30+(60/6) = 36
6 800 0 0 15 6 90+(90/15) = 96
7 500 10 10 30 9 30/10 = 3
8 200 0 0 0 0 730
IIUC, using boolean masking with bfill:
pd.set_option("use_inf_as_na", True)
df2 = df.filter(like="N_")
df["Recency"] = (df2.eq(0) * [30, 60, 90, 180]).sum(1) + ([30, 60, 90, 185] / df2).bfill(1).iloc[:, 0]
print(df)
Output:
ID Limit N_30 N_31_90 N_91_180 N_180_365 Recency
0 1 500 60 15 30 1 0.5
1 2 300 0 15 5 10 34.0
2 3 800 0 0 10 6 99.0
3 4 100 0 0 0 370 180.5
4 5 600 0 6 5 10 40.0
5 6 800 0 0 15 6 96.0
6 7 500 10 10 30 9 3.0
I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.
dat.columns = range(dat.shape[1])
There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2
I've a Dataframe as below. (resulted from pivot_table() method)
Location Loc 2 Loc 3 Loc 5 Loc 8 Loc 9
Item
1 404 317 272 113 449
3 1,205 870 846 371 1,632
5 208 218 128 31 268
7 107 54 57 17 179
9 387 564 245 83 571
10 364 280 115 34 252
16 104 80 72 22 143
17 111 85 44 10 209
18 124 182 67 27 256
19 380 465 219 103 596
if you take a closer look at it, there are missing Locations (eg, Loc 1, Loc, 4, etc) and missing Items (eg, 2, 4,8, etc)
I want to export this to my Excel pre-defined Template which has all the Locations & Items & fill the table based on Items & Values.
I know I can export the dataframe to a different excel sheet & use SUMIFS() or INDEX(), MATCH() formulas. but, I want to do this directly from Python/Panda to excel.
Below should be the result after exporting
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Loc 8 Loc 9
1 0 404 317 0 272 0 0 113 449
2 0 0 0 0 0 0 0 0 0
3 0 1205 870 0 846 0 0 371 1632
4 0 0 0 0 0 0 0 0 0
5 0 208 218 0 128 0 0 31 268
6 0 0 0 0 0 0 0 0 0
7 0 107 54 0 57 0 0 17 179
8 0 0 0 0 0 0 0 0 0
9 0 387 564 0 245 0 0 83 571
10 0 364 280 0 115 0 0 34 252
11 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0
16 0 104 80 0 72 0 0 22 143
17 0 111 85 0 44 0 0 10 209
18 0 124 182 0 67 0 0 27 256
19 0 380 465 0 219 0 0 103 596
20 0 0 0 0 0 0 0 0 0
Use DataFrame.reindex with new index and columns values in arrays or lists:
idx = np.arange(1, 21)
cols = [f'Loc {x}' for x in np.arange(1, 10)]
df = df.reindex(index=idx, columns=cols, fill_value=0)
print (df)
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Loc 8 Loc 9
1 0 404 317 0 272 0 0 113 449
2 0 0 0 0 0 0 0 0 0
3 0 1,205 870 0 846 0 0 371 1,632
4 0 0 0 0 0 0 0 0 0
5 0 208 218 0 128 0 0 31 268
6 0 0 0 0 0 0 0 0 0
7 0 107 54 0 57 0 0 17 179
8 0 0 0 0 0 0 0 0 0
9 0 387 564 0 245 0 0 83 571
10 0 364 280 0 115 0 0 34 252
11 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0
16 0 104 80 0 72 0 0 22 143
17 0 111 85 0 44 0 0 10 209
18 0 124 182 0 67 0 0 27 256
19 0 380 465 0 219 0 0 103 596
20 0 0 0 0 0 0 0 0 0
For the code:
dataset = pd.read_csv("/Users/Akshita/Desktop/EE660/donor_raw_data_medmean.csv", header=None, names=None)
# Separate data and label
X_label = dataset[1:19373][0]
X_data = dataset[1:19373]
print(X_data[X_label==1])
I get the output:(There are actually 4000~ samples with label=1)
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 \
16386 1 17 60 0 1 0 0 0 0 1 ... 0 20 20 20 5 10 15 15
16396 1 137 60 0 1 0 0 0 0 1 ... 15 25 10 15 6 14 16 120
16399 1 89 54 0 1 0 0 0 0 1 ... 10 15 5 15 6 14 16 79
16402 1 89 75 0 1 0 0 0 0 1 ... 25 35 10 35 6 13 15 79
..
..
19356 1 101 80 1 0 0 1 0 0 2 ... 25 30 5 28 7 16 18 101
19363 1 65 70 1 0 0 1 0 0 1 ... 7 12 5 10 4 8 20 63
19372 1 29 70 0 0 0 1 0 0 2 ... 0 25 25 25 4 9 24 24
..
[859 rows x 61 columns]
and for
print(X_data[X_label==0])
I get the output:(There are about 15000~ samples with label=0)
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 \
16384 0 17 74 0 1 0 0 0 0 1 ... 0 15 15 15 4 10 17 17
16385 0 17 60 0 1 0 0 0 0 2 ... 0 15 15 15 4 11 17 17
16387 0 29 67 0 1 0 0 0 0 1 ... 0 20 20 20 5 11 23 28
16388 0 53 60 0 1 0 0 0 0 1 ... 5 30 25 30 5 11 26 52
16389 0 65 49 0 1 0 0 0 0 1 ... 30 35 5 27 6 13 16 56
..
..
19369 0 137 77 1 0 1 0 0 0 1 ... 9 10 1 10 6 13 21 130
19370 0 29 60 1 0 0 1 0 0 1 ... 0 15 15 15 3 9 23 23
19371 0 129 78 1 0 0 1 0 0 2 ... 20 25 5 25 7 24 8 129
What can I be doing wrong?