I am trying to convert excel if else condition in python dataframe columns, can anyone help me out in this:
Input: df
Name1 Name2 Name3 Name4 Value1 Value2 Value3 Value4 MaxValue
0 John1 John2 John3 John4 10 3 5 7 10
1 Sony1 Sony2 Sony3 Sony4 2 12 4 8 12
2 Mark1 Mark2 Mark3 Mark4 5 13 0 3 13
3 Biky1 Biky2 Biky3 Biky4 7 7 5 44 44
4 Rose1 Rose2 Rose3 Rose4 7 0 9 7 9
Name values may not be ended with 1/2/3 etc this may have different name also.
Output: How to calculate the Final_Name column
Name1 Name2 Name3 Name4 Value1 Value2 Value3 Value4 MaxValue Final_Name
0 John1 John2 John3 John4 10 3 5 7 10 John1
1 Sony1 Sony2 Sony3 Sony4 2 12 4 8 12 Sony2
2 Mark1 Mark2 Mark3 Mark4 5 13 0 3 13 Mark2
3 Biky1 Biky2 Biky3 Biky4 7 7 5 44 44 Biky4
4 Rose1 Rose2 Rose3 Rose4 7 0 9 7 9 Rose3
In excel we, can write something like this:
=IF(I2=H2,D2,IF(I2=G2,C2,IF(I2=F2,B2,IF(I2=E2,A2,""))))
You can first filter the df into two parts , then we use the value position locate the Name
v = df.filter(regex = '^Value')
name = df.filter(regex = '^Name')
df['out'] = name.values[df.index, v.columns.get_indexer(v.idxmax(1))]
df
Out[188]:
Name1 Name2 Name3 Name4 Value1 Value2 Value3 Value4 MaxValue out
0 John1 John2 John3 John4 10 3 5 7 10 John1
1 Sony1 Sony2 Sony3 Sony4 2 12 4 8 12 Sony2
2 Mark1 Mark2 Mark3 Mark4 5 13 0 3 13 Mark2
3 Biky1 Biky2 Biky3 Biky4 7 7 5 44 44 Biky4
4 Rose1 Rose2 Rose3 Rose4 7 0 9 7 9 Rose3
You can first create a column that will show you which 'Name' column should you return back using idxmax(). Then you can stack() your 'Name' columns and merge this result with your created column above based on index and 'Name':
# Create a helper column
v_c = [c for c in df if c.startswith('Value')]
df['id_col'] = df[v_c].idxmax(axis=1).str.replace('Value','Name')
# Merge the helper column with your stacked 'Name' columns
n_c = df.filter(like='Name').columns
res = pd.merge(df[n_c].stack().reset_index(),df[['id_col']].reset_index(),left_on=['level_0','level_1'], right_on=['index','id_col'])[0]
# Assign as a column
df['Final_Name'] = res
prints:
Name1 Name2 Name3 Name4 ... Value4 MaxValue id_col Final_Name
0 John1 John2 John3 John4 ... 7 10 Name1 John1
1 Sony1 Sony2 Sony3 Sony4 ... 8 12 Name2 Sony2
2 Mark1 Mark2 Mark3 Mark4 ... 3 13 Name2 Mark2
3 Biky1 Biky2 Biky3 Biky4 ... 44 44 Name4 Biky4
4 Rose1 Rose2 Rose3 Rose4 ... 7 9 Name3 Rose3
[5 rows x 11 columns]
Related
I have a dataframe that looks like
Date col_1 col_2 col_3
2022-08-20 5 B 1
2022-07-21 6 A 1
2022-07-20 2 A 1
2022-06-15 5 B 1
2022-06-11 3 C 1
2022-06-05 5 C 2
2022-06-01 3 B 2
2022-05-21 6 A 1
2022-05-13 6 A 0
2022-05-10 2 B 3
2022-04-11 2 C 3
2022-03-16 5 A 3
2022-02-20 5 B 1
and i want to add a new column col_new that cumcount the number of rows with the same elements in col_1 and col_2 but excluding that row itself and such that the element in col_3 is 1. So the desired output would look like
Date col_1 col_2 col_3 col_new
2022-08-20 5 B 1 3
2022-07-21 6 A 1 2
2022-07-20 2 A 1 1
2022-06-15 5 B 1 2
2022-06-11 3 C 1 1
2022-06-05 5 C 2 0
2022-06-01 3 B 2 0
2022-05-21 6 A 1 1
2022-05-13 6 A 0 0
2022-05-10 2 B 3 0
2022-04-11 2 C 3 0
2022-03-16 5 A 3 0
2022-02-20 5 B 1 1
And here's what I have tried:
Date = pd.to_datetime(df['Date'], dayfirst=True)
list_col_3_is_1 = (df
.assign(Date=Date)
.sort_values('Date', ascending=True)
['col_3'].eq(1))
df['col_new'] = (list_col_3_is_1.groupby(df[['col_1','col_2']]).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
But then I got the following error: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Thanks in advance.
Your solution should be changed:
df['col_new'] = list_col_3_is_1.groupby([df['col_1'],df['col_2']]).cumsum()
print (df)
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
Assuming you already have the rows sorted in the desired order, you can use:
df['col_new'] = (df[::-1].assign(n=df['col_3'].eq(1))
.groupby(['col_1', 'col_2'])['n'].cumsum()
)
Output:
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
have got dataframe df
Aisle Table_no Table_Bit
11 2 1
11 2 2
11 2 3
11 3 1
11 3 2
11 3 3
14 2 1
14 2 2
14 2 3
and another spc_df
Aisle Table_no Item Item_time Space
11 2 Mango 2 0.25
11 2 Lemon 1 0.125
11 3 Apple 3 0.75
14 2 Orange 1 0.125
14 2 Melon 2 0.25
need to add columns spc_df['Item'] and spc_df['Space'] to dataframe df with number of times the value in spc_df['Item_time'] as given in expected output.
Additional note(may/maynot used for logic): Sum of Item_time for every Aisle-Table_no will be max of Table_Bit for that Aisle-Table_no combination.
Expected Output:
Aisle Table_no Table_Bit Item Space
11 2 1 Mango 0.25
11 2 2 Mango 0.25
11 2 3 Lemon 0.125
11 3 1 Apple 0.75
11 3 2 Apple 0.75
11 3 3 Apple 0.75
14 2 1 Orange 0.125
14 2 2 Melon 0.25
14 2 3 Melon 0.25
First repeat values in spc_df by Item_time with add counter column by GroupBy.cumcount, so possible left join by original df if Table_Bit is counter column starting by 1 per groups:
df2 = (spc_df.loc[spc_df.index.repeat(spc_df['Item_time'])]
.assign(Table_Bit = lambda x: x.groupby(['Aisle','Table_no']).cumcount().add(1)))
df = df.merge(df2, how='left')
print (df)
Aisle Table_no Table_Bit Item Item_time Space
0 11 2 1 Mango 2 0.250
1 11 2 2 Mango 2 0.250
2 11 2 3 Lemon 1 0.125
3 11 3 1 Apple 3 0.750
4 11 3 2 Apple 3 0.750
5 11 3 3 Apple 3 0.750
6 14 2 1 Orange 1 0.125
7 14 2 2 Melon 2 0.250
8 14 2 3 Melon 2 0.250
If not Table_ID is counter column create helper column new:
df2 = (spc_df.loc[spc_df.index.repeat(spc_df['Item_time'])]
.assign(new= lambda x: x.groupby(['Aisle','Table_no']).cumcount()))
df = (df.assign(new = df.groupby(['Aisle','Table_no']).cumcount())
.merge(df2, how='left')
.drop('new', axis=1))
print (df)
Aisle Table_no Table_Bit Item Item_time Space
0 11 2 1 Mango 2 0.250
1 11 2 2 Mango 2 0.250
2 11 2 3 Lemon 1 0.125
3 11 3 1 Apple 3 0.750
4 11 3 2 Apple 3 0.750
5 11 3 3 Apple 3 0.750
6 14 2 1 Orange 1 0.125
7 14 2 2 Melon 2 0.250
8 14 2 3 Melon 2 0.250
I have dataframe, i want to split dataframe in groups based on condition from flag_0 and flag_1 column , when flag_0 is '3' and and flag_1 is '1' continous.
Here is my dataframe example:
df=pd.DataFrame({'flag_0':[1,2,3,1,2,3,1,2,3,3,3,3,1,2,3,1,2,3,4,4],'flag_1':[1,2,3,1,2,3,1,2,1,1,1,1,1,2,1,1,2,3,4,4],'dd':[1,1,1,7,7,7,8,8,8,1,1,1,7,7,7,8,8,8,5,7]})
Out[172]:
flag_0 flag_1 dd
0 1 1 1
1 2 2 1
2 3 3 1
3 1 1 7
4 2 2 7
5 3 3 7
6 1 1 8
7 2 2 8
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1
12 1 1 7
13 2 2 7
14 3 1 7
15 1 1 8
16 2 2 8
17 3 3 8
18 4 4 5
19 4 4 7
Desired output:
group_1
Out[172]:
flag_0 flag_1 dd
9 3 1 1
10 3 1 1
11 3 1 1
group 2
Out[172]:
flag_0 flag_1 dd
14 3 1 7
You can use a mask and groupby to split the dataframe:
cond = {'flag_0': 3, 'flag_1': 1}
mask = df[list(cond)].eq(cond).all(1)
groups = [g for k,g in df[mask].groupby((~mask).cumsum())]
output:
[ flag_0 flag_1 dd
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1,
flag_0 flag_1 dd
14 3 1 7]
groups[0]
flag_0 flag_1 dd
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1
I read a PDF file with PDFMiner and I get a string; following that structure:
text
text
text
col1
1
2
3
4
5
col2
(1)
(2)
(3)
(7)
(4)
col3
name1
name2
name3
name4
name5
col4
name
5
45
7
87
8
col5
FAE
EFD
SDE
FEF
RGE
col6
name
45
7
54
4
130
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
7
1
8
text1
text1
text1
col1
6
7
8
9
10
col2
(1)
(2)
(3)
(7)
(4)
col3
name6
name7
name8
name9
name10
col4
name
54
4
78
8
86
col5
SDE
FFF
EEF
GFE
JHG
col6
name
6
65
65
45
78
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
4
1
54
I have 10 columns named: col1, col2, col3, col4 name, col5, col6 name, # col7, col8, col9, col10 name.
But as I have those 10 columns on each page; I get the structure repeated. Those names will always be the same, on each page. I am not sure how to pull it all in the same dataframe.
For example for col1 I would have in the dataframe:
1
2
3
4
5
6
7
8
9
10
I also have some empty columns (col8 in my example) and I am not sure how to deal with it.
Any idea? thanks!
You can use regex to parse the document (regex101), for example (txt is your string from the question):
import re
d = {}
for col_name, cols in re.findall(r'\n^((?:#\s)?col\d+(?:\n\s*name\n+)?)(.*?)(?=\n\n|^(?:#\s)?col\d+|\Z)', txt, flags=re.M|re.S):
d.setdefault(col_name.strip(), []).extend(cols.strip().split('\n'))
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
Prints:
col1 col2 col3 col4\n name col5 col6\n name # col7 col8 col9 col10\nname
0 1 (1) name1 5 FAE 45 16 55 1
1 2 (2) name2 45 EFD 7 18 30 7
2 3 (3) name3 7 SDE 54 22 None 60 1
3 4 (7) name4 87 FEF 4 17 None 1 8
4 5 (4) name5 8 RGE 130 25 None 185 1
5 6 (1) name6 54 SDE 6 16 None 55 4
6 7 (2) name7 4 FFF 65 18 None 30 1
7 8 (3) name8 78 EEF 65 22 None 60 54
8 9 (7) name9 8 GFE 45 17 None 1 None
9 10 (4) name10 86 JHG 78 25 None 185 None
I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0