Groupby and if condtion on a data frame in pandas - python-3.x

I have a below data frame
df=
city code qty1 qty2 month type
hyd 1 10 12 1 x
hyd 2 12 21 y
hyd 2 15 36 x
hyd 4 25 44 3 z
pune 1 10 1 x
pune 3 12 2 2 y
pune 1 15 3 x
pune 2 25 4 x
ban 2 10 1 1 X
ban 4 10 2 x
ban 2 12 3 x
ban 1 15 4 3 y
I want to groupby(city and code) and find both res1 and res2 based on the below conditions.
The result data frame is
result=
city code res1 res2
hyd 1 Nan 12
hyd 2 27 Nan
hyd 4 Nan Nan
pune 1 25 Nan
pune 3 Nan Nan
pune 2 25 Nan
ban 2 12 10
ban 4 10 Nan
ban 1 Nan Nan
I have tried grouping and itering the result of groupyby with the conditions. But no result. Any help would be appreciated. Thanks

You can groupby then calculated what you need one by one , then concat back
g=df.groupby(['city','code'])
pd.concat([g.apply(lambda x : sum(x['qty1'][x['month']==''])),g.apply(lambda x : sum(x['qty2'][(x['month']!='')&(x['type']=='x')]))],axis=1)
Out[135]:
0 1
city code
ban 1 0 0
2 12 0
4 10 0
hyd 1 0 12
2 27 0
4 0 0
pune 1 25 0
2 25 0
3 0 0

IIUC
df = df.set_index(['city', 'code'])
cond1 = df.month.isnull()
df['res1'] = df[cond1].groupby(['city', 'code']).qty1.sum()
cond2 = df.month.notnull() & (df.type=='x')
df['res2'] = df[cond2].groupby(['city', 'code']).qty2.sum()
qty1 qty2 month type res1 res2
city code
hyd 1 10 12 1.0 x NaN 12.0
2 12 21 NaN y 27.0 NaN
2 15 36 NaN x 27.0 NaN
4 25 44 3.0 z NaN NaN
pune 1 10 1 NaN x 25.0 NaN
3 12 2 2.0 y NaN NaN
1 15 3 NaN x 25.0 NaN
2 25 4 NaN x 25.0 NaN
ban 2 10 1 1.0 x 12.0 1.0
4 10 2 NaN x 10.0 NaN
2 12 3 NaN x 12.0 1.0
1 15 4 3.0 y NaN NaN

Related

Pandas: Combine pandas columns that have the same column name

If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46

Read dataframe split by nan rows and reshape them into multiple dataframes in Python

I have a example excel file data1.xlsx from here, which has a Sheet1 as follows:
Now I want to read it with openpyxl or pandas, then convert them into new df1 and df2, I will finally save them as price and quantity sheet:
price sheet:
and quantity sheet
Code I have used:
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1')
df_list = np.split(df, df[df.isnull().all(1)].index)
for df in df_list:
print(df, '\n')
Out:
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 year 2018.0 2019.0 2020.0 sum
1 price 12.0 4.0 5.0 21
2 quantity 5.0 5.0 3.0 13
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
3 NaN NaN NaN NaN NaN
4 sh NaN NaN NaN NaN
5 year 2018.0 2019.0 2020.0 sum
6 price 5.0 6.0 7.0 18
7 quantity 7.0 5.0 4.0 16
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
8 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
9 NaN NaN NaN NaN NaN
10 gz NaN NaN NaN NaN
11 year 2018.0 2019.0 2020.0 sum
12 price 2.0 3.0 1.0 6
13 quantity 6.0 9.0 3.0 18
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
14 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
15 NaN NaN NaN NaN NaN
16 sz NaN NaN NaN NaN
17 year 2018.0 2019.0 2020.0 sum
18 price 8.0 2.0 3.0 13
19 quantity 5.0 4.0 3.0 12
How could I do that in Python? Thanks a lot.
Use:
#add header=None for default columns names
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1', header=None)
#convert columns by second row
df.columns = df.iloc[1].rename(None)
#create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
#convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
#convert column year to index
df = df.set_index('year')
print (df)
city 2018 2019 2020 sum
year
bj bj NaN NaN NaN NaN
year bj 2018.0 2019.0 2020.0 sum
price bj 12.0 4.0 5.0 21
quantity bj 5.0 5.0 3.0 13
NaN bj NaN NaN NaN NaN
sh sh NaN NaN NaN NaN
year sh 2018.0 2019.0 2020.0 sum
price sh 5.0 6.0 7.0 18
quantity sh 7.0 5.0 4.0 16
NaN sh NaN NaN NaN NaN
NaN sh NaN NaN NaN NaN
gz gz NaN NaN NaN NaN
year gz 2018.0 2019.0 2020.0 sum
price gz 2.0 3.0 1.0 6
quantity gz 6.0 9.0 3.0 18
NaN gz NaN NaN NaN NaN
NaN gz NaN NaN NaN NaN
sz sz NaN NaN NaN NaN
year sz 2018.0 2019.0 2020.0 sum
price sz 8.0 2.0 3.0 13
quantity sz 5.0 4.0 3.0 12
df1 = df.loc['price'].reset_index(drop=True)
print (df1)
city 2018 2019 2020 sum
0 bj 12.0 4.0 5.0 21
1 sh 5.0 6.0 7.0 18
2 gz 2.0 3.0 1.0 6
3 sz 8.0 2.0 3.0 13
df2 = df.loc['quantity'].reset_index(drop=True)
print (df2)
city 2018 2019 2020 sum
0 bj 5.0 5.0 3.0 13
1 sh 7.0 5.0 4.0 16
2 gz 6.0 9.0 3.0 18
3 sz 5.0 4.0 3.0 12
Last write DataFrames to existing file is possible by mode='a' parameter, link:
with pd.ExcelWriter('data1.xlsx', mode='a') as writer:
df1.to_excel(writer, sheet_name='price')
df2.to_excel(writer, sheet_name='quantity')

Python Pandas DataFrame adding a fixed value from list to cloumn and Generating new Column output for each of this list values

This may be a different in requirements
i have data frame
A B C
1 2 4
2 4 6
8 10 12
1 3 5
and a dynamic list (the length may vary
list[1,2,3,4,5,6,7,8,9,10,11]
I wish to add C colum value with list value with each of this list and generate new dataframe column with added value how to do this?
A B C C_1 C_2 .......................... C_11
1 2 4 5 6 15
2 4 6 7 8 17
8 10 12 11 14
1 3 5 6 7 16
Thank you for your support
you can use a dict comprehension to create a simple dataframe.
dynamic_vals = [1,2,3,4,5,6,7,8,9,10,11]
df2 = pd.concat(
[df,pd.DataFrame({f'C_{val}' : [0] for val in dynamic_vals })]
,axis=1).fillna(0)
print(df2)
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 4 6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 8 10 12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1 3 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
or you could use assign again with the suggestion of #piRSqaured
df2 = df.assign(**dict((f'C_{i}', np.nan) for i in dynamic_vals))
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8 10 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 3 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
or better and more simple solution suggested by #piRSquare
df.join(pd.DataFrame(np.nan, df.index, dynamic_vals).add_prefix('C_')
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8 10 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 3 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Edit :
Using df.join with a dictionary comprehension.
df.join(pd.DataFrame({f'C_{val}' : df['C'].values + val for val in dynamic_vals }))
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 5 6 7 8 9 10 11 12 13 14 15
1 2 4 6 7 8 9 10 11 12 13 14 15 16 17
2 8 10 12 13 14 15 16 17 18 19 20 21 22 23
3 1 3 5 6 7 8 9 10 11 12 13 14 15 16

Python: Summing every five rows of column b data and create a new column

I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0

Calculating rolling sum in a pandas dataframe on the basis of 2 variable constraints

I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0

Resources