How to do similar type of columns addition in Pyspark? - apache-spark

I want to do addition of similar type of columns (total columns are more than 100) as follows:
id
b
c
d
b_apac
c_apac
d_apac
abcd
3
5
null
45
9
1
bcd
13
15
1
45
2
10
cd
32
null
6
45
90
1
resultant table should look like this:
id
b_sum
c_sum
d_sum
abcd
48
14
1
bcd
58
17
11
cd
77
90
7
Please help me with some generic code as I have more than 100 columns to do this for. |

You can use use sum and check the prefix of your column name:
df.select(
'id',
sum([df[col] for col in df.columns if col.startswith('b')]).alias('b_sum'),
sum([df[col] for col in df.columns if col.startswith('c')]).alias('c_sum'),
sum([df[col] for col in df.columns if col.startswith('d')]).alias('d_sum'),
).show(10, False)

Related

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Unstack a dataframe with duplicated index in Pandas

Given a toy dataset as follow which has duplicated price and quantity:
city item value
0 bj price 12
1 bj quantity 15
2 bj price 12
3 bj quantity 15
4 bj level a
5 sh price 45
6 sh quantity 13
7 sh price 56
8 sh quantity 7
9 sh level b
I want to reshape it into the following dataframe, which means add sell_ for the first pair and buy_ for the second pair:
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 13 16 a
1 sh 45 13 56 7 b
I have tried with df.set_index(['city', 'item']).unstack().reset_index(), but it raises an error: ValueError: Index contains duplicate entries, cannot reshape.
How could I get the desired output as above? Thanks.
You can add for second duplicated values buy_ and for first duplicates sell_ and change values in item before your solution:
m1 = df.duplicated(['city', 'item'])
m2 = df.duplicated(['city', 'item'], keep=False)
df['item'] = np.where(m1, 'buy_', np.where(m2, 'sell_', '')) + df['item']
df = (df.set_index(['city', 'item'])['value']
.unstack()
.reset_index()
.rename_axis(None, axis=1))
#for change order of columns names
df = df[['city','sell_price','sell_quantity','buy_price','buy_quantity','level']]
print (df)
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 12 15 a
1 sh 45 13 56 7 b

Creating an aggregate columns in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "B", "B"], 'var1':[2, 3, 1, 5],'a1_bal':[1,2,3,4], 'a1c_bal':[10,22,36,41], 'b1_bal':[1,2,33,4], 'b1c_bal':[11,22,3,4], 'm1_bal':[15,2,35,4]})
df
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal
0 A 2 1 10 1 11 15
1 A 3 2 22 2 22 2
2 B 1 3 36 33 3 35
3 B 5 4 41 4 4 4
I want to create new columns as below:
a1_final_bal = sum(a1_bal, a1c_bal)
b1_final_bal = sum(b1_bal, b1c_bal)
m1_final_bal = m1_bal (since we only have m1_bal field not m1c_bal, so it will renain as it is)
I don't want to hardcode this step because there might be more such columns as "c_bal", "m2_bal", "m2c_bal" etc..
My final data should look something like below
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal a1_final_bal b1_final_bal m1_final_bal
0 A 2 1 10 1 11 15 11 12 15
1 A 3 2 22 2 22 2 24 24 2
2 B 1 3 36 33 3 35 38 36 35
3 B 5 4 41 4 4 4 45 8 4
You could try something like this. I am not sure if its exactly what you are looking for, but I think it should work.
dfforgroup = df.set_index(['ORDER','var1']) #Creates MultiIndex
dfforgroup.columns = dfforgroup.columns.str[:2] #Takes first two letters of remaining columns
df2 = dfforgroup.groupby(dfforgroup.columns,axis=1).sum().reset_index().drop(columns =
['ORDER','var1']).add_suffix('_final_bal') #groups columns by their first two letters and sums the columns up
df = pd.concat([df,df2],axis=1) #concatenates new columns to original df

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

Resources