How to exclude rows by identifying values from two columns? - python-3.x

I have two data-frames:
df1:
X1 X2 X3
1 2 100
2 3 90
1 3 100
3 1 110
2 1 20
1 3 30
2 3 40
3 1 50
df2:
X1 X2 X3 Y
1 2 100 1
2 3 90 1
1 3 100 1
3 1 110 0
2 1 20 0
1 3 30 0
2 3 40 1
3 1 50 0
I want to exclude rows from df1, for those who have the value 1 in Y column in df2.
The identifier is (X1, X2).
Expected result:
X1 X2 X3
3 1 110
2 1 20
3 1 50

Create DataFrame by filtering by Y column and remove duplicates for avoid duplicated rows in output:
df3 = df2.loc[df2['Y'].eq(1), ['X1','X2']].drop_duplicates()
print (df3)
X1 X2
0 1 2
1 2 3
2 1 3
Then use left join with indicator=True parameter in DataFrame.merge and filter left_only rows:
df = df1.merge(df3, indicator=True, how='left').query('_merge =="left_only"').drop('_merge',1)
print (df)
X1 X2 X3
3 3 1 110
4 2 1 20
7 3 1 50

Related

Grouping several dataframe columns based on another columns values

I have this dataframe:
refid col2 price1 factor1 price2 factor2 price3 factor3
0 1 a 200 1 180 3 150 10
1 2 b 500 1 450 3 400 10
2 3 c 700 1 620 2 550 5
And I need to get this output:
refid col2 price factor
0 1 a 200 1
1 1 b 500 1
2 1 c 700 1
3 2 a 180 3
4 2 b 450 3
5 2 c 620 2
6 3 a 150 10
7 3 b 400 10
8 3 c 550 5
Right now I'm trying to use df.melt method, but can't get it to work, this is the code and the current result:
df2_melt = df2.melt(id_vars=["refid","col2"],
value_vars=["price1","price2","price3",
"factor1","factor2","factor3"],
var_name="Price",
value_name="factor")
refid col2 price factor
0 1 a price1 200
1 2 b price1 500
2 3 c price1 700
3 1 a price2 180
4 2 b price2 450
5 3 c price2 620
6 1 a price3 150
7 2 b price3 400
8 3 c price3 550
9 1 a factor1 1
10 2 b factor1 1
11 3 c factor1 1
12 1 a factor2 3
13 2 b factor2 3
14 3 c factor2 2
15 1 a factor3 10
16 2 b factor3 10
17 3 c factor3 5
Since you have a wide DataFrame with common prefixes, you can use wide_to_long:
out = pd.wide_to_long(df, stubnames=['price','factor'],
i=["refid","col2"], j='num').droplevel(-1).reset_index()
Output:
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
Note that your expected output has an error where factors don't align with refids.
You can melt two times and then concat them:
import pandas as pd
df = pd.DataFrame({'refid': [1, 2, 3], 'col2': ['a', 'b', 'c'],
'price1': [200, 500, 700], 'factor1': [1, 1, 1],
'price2': [180, 450, 620], 'factor2': [3,3,2],
'price3': [150, 400, 550], 'factor3': [10, 10, 5]})
prices = [c for c in df if c.startswith('price')]
factors = [c for c in df if c.startswith('factor')]
df1 = pd.melt(df, id_vars=["refid","col2"], value_vars=prices, value_name='price').drop('variable', axis=1)
df2 = pd.melt(df, id_vars=["refid","col2"], value_vars=factors, value_name='factor').drop('variable', axis=1)
df3 = pd.concat([df1, df2['factor']],axis=1).reset_index().drop('index', axis=1)
print(df3)
Here is the output:
refid col2 price factor
0 1 a 200 1
1 2 b 500 1
2 3 c 700 1
3 1 a 180 3
4 2 b 450 3
5 3 c 620 2
6 1 a 150 10
7 2 b 400 10
8 3 c 550 5
One option is pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.pivot_longer(
index = ['refid', 'col2'],
names_to = '.value',
names_pattern = r"(.+)\d",
sort_by_appearance = True)
)
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

Calculate column by condition

I have table in df:
X1 X2
1 1
1 2
2 2
2 2
3 3
3 3
And i want calculate Y, where Y = Yprevious + 1 if X1=X1previous and X2=X2previous, elso 0. Y on first line = 0. Example.
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
Not a duplicate... Previously, the question was simpler - addition with a value in a specific line. Now the term appears in the calculation process. I need some cumulative calculation
What I need, more example:
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
3 3 1
2 2 0
What I get on the link to the duplicate
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
3 3 1
2 2 4
Use GroupBy.cumcount with new columns by consecutive values:
df1 = df[['X1','X2']].ne(df[['X1','X2']].shift()).cumsum()
df['Y'] = df.groupby([df1['X1'], df1['X2']]).cumcount()
print (df)
X1 X2 Y
0 1 1 0
1 2 2 0
2 2 2 1
3 2 2 2
4 2 2 3
5 3 3 0
6 3 3 1
7 2 2 0

Subset and Loop to create a new column [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

How to iterate through 'nested' dataframes without 'for' loops in pandas (python)?

I'm trying to check the cartesian distance between each set of points in one dataframe to sets of scattered points in another dataframe, to see if the input gets above a threshold 'distance' of my checking points.
I have this working with nested for loops, but is painfully slow (~7 mins for 40k input rows, each checked vs ~180 other rows, + some overhead operations).
Here is what I'm attempting in vectorialized format - 'for every pair of points (a,b) from df1, if the distance to ANY point (d,e) from df2 is > threshold, print "yes" into df1.c, next to input points.
..but I'm getting unexpected behavior from this. With given data, all but one distances are > 1, but only df1.1c is getting 'yes'.
Thanks for any ideas - the problem is probably in the 'df1.loc...' line:
import numpy as np
from pandas import DataFrame
inp1 = [{'a':1, 'b':2, 'c':0}, {'a':1,'b':3,'c':0}, {'a':0,'b':3,'c':0}]
df1 = DataFrame(inp1)
inp2 = [{'d':2, 'e':0}, {'d':0,'e':3}, {'d':0,'e':4}]
df2 = DataFrame(inp2)
threshold = 1
df1.loc[np.sqrt((df1.a - df2.d) ** 2 + (df1.b - df2.e) ** 2) > threshold, 'c'] = "yes"
print(df1)
print(df2)
a b c
0 1 2 yes
1 1 3 0
2 0 3 0
d e
0 2 0
1 0 3
2 0 4
Here is an idea to help you to start...
Source DFs:
In [170]: df1
Out[170]:
c x y
0 0 1 2
1 0 1 3
2 0 0 3
In [171]: df2
Out[171]:
x y
0 2 0
1 0 3
2 0 4
Helper DF with cartesian product:
In [172]: x = df1[['x','y']] \
.reset_index() \
.assign(k=0).merge(df2.assign(k=0).reset_index(),
on='k', suffixes=['1','2']) \
.drop('k',1)
In [173]: x
Out[173]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
4 1 1 3 1 0 3
5 1 1 3 2 0 4
6 2 0 3 0 2 0
7 2 0 3 1 0 3
8 2 0 3 2 0 4
now we can calculate the distance:
In [169]: x.eval("D=sqrt((x1 - x2)**2 + (y1 - y2)**2)", inplace=False)
Out[169]:
index1 x1 y1 index2 x2 y2 D
0 0 1 2 0 2 0 2.236068
1 0 1 2 1 0 3 1.414214
2 0 1 2 2 0 4 2.236068
3 1 1 3 0 2 0 3.162278
4 1 1 3 1 0 3 1.000000
5 1 1 3 2 0 4 1.414214
6 2 0 3 0 2 0 3.605551
7 2 0 3 1 0 3 0.000000
8 2 0 3 2 0 4 1.000000
or filter:
In [175]: x.query("sqrt((x1 - x2)**2 + (y1 - y2)**2) > #threshold")
Out[175]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
5 1 1 3 2 0 4
6 2 0 3 0 2 0
Try using scipy implementation, it is surprisingly fast
scipy.spatial.distance.pdist
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
or
scipy.spatial.distance_matrix
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.spatial.distance_matrix.html

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources