Grouping several dataframe columns based on another columns values - python-3.x

I have this dataframe:
refid col2 price1 factor1 price2 factor2 price3 factor3
0 1 a 200 1 180 3 150 10
1 2 b 500 1 450 3 400 10
2 3 c 700 1 620 2 550 5
And I need to get this output:
refid col2 price factor
0 1 a 200 1
1 1 b 500 1
2 1 c 700 1
3 2 a 180 3
4 2 b 450 3
5 2 c 620 2
6 3 a 150 10
7 3 b 400 10
8 3 c 550 5
Right now I'm trying to use df.melt method, but can't get it to work, this is the code and the current result:
df2_melt = df2.melt(id_vars=["refid","col2"],
value_vars=["price1","price2","price3",
"factor1","factor2","factor3"],
var_name="Price",
value_name="factor")
refid col2 price factor
0 1 a price1 200
1 2 b price1 500
2 3 c price1 700
3 1 a price2 180
4 2 b price2 450
5 3 c price2 620
6 1 a price3 150
7 2 b price3 400
8 3 c price3 550
9 1 a factor1 1
10 2 b factor1 1
11 3 c factor1 1
12 1 a factor2 3
13 2 b factor2 3
14 3 c factor2 2
15 1 a factor3 10
16 2 b factor3 10
17 3 c factor3 5

Since you have a wide DataFrame with common prefixes, you can use wide_to_long:
out = pd.wide_to_long(df, stubnames=['price','factor'],
i=["refid","col2"], j='num').droplevel(-1).reset_index()
Output:
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
Note that your expected output has an error where factors don't align with refids.

You can melt two times and then concat them:
import pandas as pd
df = pd.DataFrame({'refid': [1, 2, 3], 'col2': ['a', 'b', 'c'],
'price1': [200, 500, 700], 'factor1': [1, 1, 1],
'price2': [180, 450, 620], 'factor2': [3,3,2],
'price3': [150, 400, 550], 'factor3': [10, 10, 5]})
prices = [c for c in df if c.startswith('price')]
factors = [c for c in df if c.startswith('factor')]
df1 = pd.melt(df, id_vars=["refid","col2"], value_vars=prices, value_name='price').drop('variable', axis=1)
df2 = pd.melt(df, id_vars=["refid","col2"], value_vars=factors, value_name='factor').drop('variable', axis=1)
df3 = pd.concat([df1, df2['factor']],axis=1).reset_index().drop('index', axis=1)
print(df3)
Here is the output:
refid col2 price factor
0 1 a 200 1
1 2 b 500 1
2 3 c 700 1
3 1 a 180 3
4 2 b 450 3
5 3 c 620 2
6 1 a 150 10
7 2 b 400 10
8 3 c 550 5

One option is pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.pivot_longer(
index = ['refid', 'col2'],
names_to = '.value',
names_pattern = r"(.+)\d",
sort_by_appearance = True)
)
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

Related

Merge pandas data frame based on specific conditions

I have a df as shown below
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
df2:
ID Type Status Age
1 2 P 23
2 1 P 28
8 1 F 33
4 3 P 48
14 1 F 23
11 2 P 28
16 2 F 23
41 3 P 38
df3:
ID T_Type Amount
1 K 20
2 L -50
1 K 30
3 K 5
1 K 100
2 L -50
1 L -30
25 K 500
1 K 20
4 L -80
19 K 30
2 K -5
Explanation About the data
ID is the primary key of df1.
ID is the primary key of df2.
df3 does not have any primary key.
From the above, I would like to prepare below dfs.
1. IDs which are in df1 and df2.
Expected output1:
ID Job Salary
1 A 100
2 B 200
4 C 150
8 B 150
IDs which are there in df1 and not in df2
output2:
ID Job Salary
3 B 20
5 A 500
6 A 600
7 A 200
IDs which are there in df1 and df3
output3:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
4. IDs which are there in df1 and not in df3.
output4:
ID Job Salary
5 A 500
6 A 600
7 A 200
8 B 150
>>> # 1. IDs which are in df1 and df2.
>>> df1[df1['ID'].isin(df2['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
3 4 C 150
7 8 B 150
>>> # 2. IDs which are there in df1 and not in df2
>>> df1[~df1['ID'].isin(df2['ID'])]
ID Job Salary
2 3 B 20
4 5 A 500
5 6 A 600
6 7 A 200
>>> # 3. IDs which are there in df1 and df3
>>> df1[df1['ID'].isin(df3['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20
3 4 C 150
>>> # 4. IDs which are there in df1 and not in df3.
>>> df1[~df1['ID'].isin(df3['ID'])]
ID Job Salary
4 5 A 500
5 6 A 600
6 7 A 200
7 8 B 150
Actually, your expected results aren't any merges, but rather
selections, based on whether df1.ID is (or is not) in ID column
of the second DataFrame.
To get your expected results, run the following commands:
result_1 = df1[df1.ID.isin(df2.ID)]
result_2 = df1[~df1.ID.isin(df2.ID)]
result_3 = df1[df1.ID.isin(df3.ID)]
result_4 = df1[~df1.ID.isin(df3.ID)]

Reassigning multiple columns with same array

I've been breaking my head over this simple thing. Ik we can assign a single value to multiple columns using .loc. But how to assign multiple columns with the same array.
Ik I can do this. Let's say we have a dataframe df in which I wish to replace some columns with the array arr:
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
>>df
a b c
0 14 8 5
1 10 25 9
2 14 14 8
3 10 6 7
4 4 18 2
arr = [i for i in range(5)]
#Suppose if I wish to replace columns `a` and `b` with array `arr`
df['a'],df['b']=[arr for j in range(2)]
Desired output:
a b c
0 0 0 16
1 1 1 10
2 2 2 1
3 3 3 20
4 4 4 11
Or I can also do this in a loopwise assignment. But is there a more efficient way without repetition or loops?
Let's try with assign:
cols = ['a', 'b']
df.assign(**dict.fromkeys(cols, arr))
a b c
0 0 0 5
1 1 1 9
2 2 2 8
3 3 3 7
4 4 4 2
I did an assign statement df.a = df.b = arr
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
arr = [i for i in range(5)]
df
a b c
0 2 8 18
1 17 15 25
2 6 5 17
3 12 15 25
4 10 10 6
df.a = df.b = arr
df
a b c
0 0 0 18
1 1 1 25
2 2 2 17
3 3 3 25
4 4 4 6

Subset and Loop to create a new column [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

summing up certain rows in a panda dataframe

I have a pandas dataframe with 1000 rows and 10 columns. I am looking to aggregate rows 100-1000 and replace them with just one row where the indexvalue is '>100' and the column values are the sum of rows 100-1000 of each column. Any ideas on a simple way of doing this? Thanks in advance
Say I have the below
a b c
0 1 10 100
1 2 20 100
2 3 60 100
3 5 80 100
and I want it replaced with
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
You could use ix or loc but it shows SettingWithCopyWarning:
ind = 1
mask = df.index > ind
df1 = df[~mask]
df1.ix['>1', :] = df[mask].sum()
In [69]: df1
Out[69]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
To set it without warning you could do it with pd.concat. May be not elegant due to two transposing but worked:
ind = 1
mask = df.index > ind
df1 = pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
df1.index = df1.index.tolist()[:-1] + ['>{}'.format(ind)]
In [36]: df1
Out[36]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
Some demonstrations:
In [37]: df.index > ind
Out[37]: array([False, False, True, True], dtype=bool)
In [38]: df[mask].sum()
Out[38]:
a 8
b 140
c 200
dtype: int64
In [40]: pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
Out[40]:
a b c
0 1 10 100
1 2 20 100
0 8 140 200

Resources