How to iterate through 'nested' dataframes without 'for' loops in pandas (python)? - python-3.x

I'm trying to check the cartesian distance between each set of points in one dataframe to sets of scattered points in another dataframe, to see if the input gets above a threshold 'distance' of my checking points.
I have this working with nested for loops, but is painfully slow (~7 mins for 40k input rows, each checked vs ~180 other rows, + some overhead operations).
Here is what I'm attempting in vectorialized format - 'for every pair of points (a,b) from df1, if the distance to ANY point (d,e) from df2 is > threshold, print "yes" into df1.c, next to input points.
..but I'm getting unexpected behavior from this. With given data, all but one distances are > 1, but only df1.1c is getting 'yes'.
Thanks for any ideas - the problem is probably in the 'df1.loc...' line:
import numpy as np
from pandas import DataFrame
inp1 = [{'a':1, 'b':2, 'c':0}, {'a':1,'b':3,'c':0}, {'a':0,'b':3,'c':0}]
df1 = DataFrame(inp1)
inp2 = [{'d':2, 'e':0}, {'d':0,'e':3}, {'d':0,'e':4}]
df2 = DataFrame(inp2)
threshold = 1
df1.loc[np.sqrt((df1.a - df2.d) ** 2 + (df1.b - df2.e) ** 2) > threshold, 'c'] = "yes"
print(df1)
print(df2)
a b c
0 1 2 yes
1 1 3 0
2 0 3 0
d e
0 2 0
1 0 3
2 0 4

Here is an idea to help you to start...
Source DFs:
In [170]: df1
Out[170]:
c x y
0 0 1 2
1 0 1 3
2 0 0 3
In [171]: df2
Out[171]:
x y
0 2 0
1 0 3
2 0 4
Helper DF with cartesian product:
In [172]: x = df1[['x','y']] \
.reset_index() \
.assign(k=0).merge(df2.assign(k=0).reset_index(),
on='k', suffixes=['1','2']) \
.drop('k',1)
In [173]: x
Out[173]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
4 1 1 3 1 0 3
5 1 1 3 2 0 4
6 2 0 3 0 2 0
7 2 0 3 1 0 3
8 2 0 3 2 0 4
now we can calculate the distance:
In [169]: x.eval("D=sqrt((x1 - x2)**2 + (y1 - y2)**2)", inplace=False)
Out[169]:
index1 x1 y1 index2 x2 y2 D
0 0 1 2 0 2 0 2.236068
1 0 1 2 1 0 3 1.414214
2 0 1 2 2 0 4 2.236068
3 1 1 3 0 2 0 3.162278
4 1 1 3 1 0 3 1.000000
5 1 1 3 2 0 4 1.414214
6 2 0 3 0 2 0 3.605551
7 2 0 3 1 0 3 0.000000
8 2 0 3 2 0 4 1.000000
or filter:
In [175]: x.query("sqrt((x1 - x2)**2 + (y1 - y2)**2) > #threshold")
Out[175]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
5 1 1 3 2 0 4
6 2 0 3 0 2 0

Try using scipy implementation, it is surprisingly fast
scipy.spatial.distance.pdist
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
or
scipy.spatial.distance_matrix
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.spatial.distance_matrix.html

Related

How to sort pandas rows based on column values

in this dataframe:
Feat1 Feat2 Feat3 Feat4 Labels
-46.220314 22.862856 -6.1573067 5.6060414 2
-23.80669 20.536781 -5.015675 4.2216353 2
-42.092365 25.680704 -5.0092897 5.665794 2
-35.29639 21.709473 -4.160352 5.578346 2
-37.075096 22.347767 -3.860426 5.6953945 2
-42.8849 28.03802 -7.8572545 3.3361 2
-32.3057 26.568039 -9.47018 3.4532788 2
-24.469942 27.005375 -9.301921 4.3995037 2
-97.89892 -0.38156664 6.4163384 7.234347 1
-81.96325 0.1821717 -1.2870358 4.703838 1
-78.41986 -6.766374 0.8001185 0.83444935 1
-100.68544 -4.5810957 1.6977689 1.8801615 1
-87.05412 -2.9231584 6.817379 5.4460077 1
-64.121056 -3.7892206 -0.283514 6.3084154 1
-94.504845 -0.9999217 3.2884297 6.881124 1
-61.951996 -8.960198 -1.5915259 5.6160254 1
-108.19452 13.909201 0.6966458 -1.956591 0
-97.4037 22.897585 -2.8488266 1.4105041 0
-92.641335 22.10624 -3.5110545 2.467166 0
-199.18787 3.3090565 -2.5994794 4.0802555 0
-137.5976 6.795896 1.6793671 2.2256763 0
-208.0035 -1.33229 -3.2078092 1.5177402 0
-108.225975 14.341716 1.02891 -1.8651972 0
-121.29299 18.274035 2.2891548 2.3360753 0
I wanted to sort the rows based on different column values in the "Labels" column.
I am able to sort in ascending such that the labels appear as [0 1 2] via the command
df2 = df1.sort_values(by = 'Labels', ascending = True)
Then ascending = False, where the labels appear [2 1 0].
How then do I go about sorting the labels as [1 0 2]?
Any help will be greatly appreciated!
Here's a way using Categorical:
df['Labels'] = pd.Categorical(df['Labels'],
categories = [1, 0, 2],
ordered=True)
df.sort_values('Labels')
Output:
Feat1 Feat2 Feat3 Feat4 Labels
11 -100.685440 -4.581096 1.697769 1.880162 1
15 -61.951996 -8.960198 -1.591526 5.616025 1
8 -97.898920 -0.381567 6.416338 7.234347 1
9 -81.963250 0.182172 -1.287036 4.703838 1
10 -78.419860 -6.766374 0.800118 0.834449 1
14 -94.504845 -0.999922 3.288430 6.881124 1
12 -87.054120 -2.923158 6.817379 5.446008 1
13 -64.121056 -3.789221 -0.283514 6.308415 1
21 -208.003500 -1.332290 -3.207809 1.517740 0
20 -137.597600 6.795896 1.679367 2.225676 0
19 -199.187870 3.309057 -2.599479 4.080255 0
18 -92.641335 22.106240 -3.511055 2.467166 0
17 -97.403700 22.897585 -2.848827 1.410504 0
16 -108.194520 13.909201 0.696646 -1.956591 0
23 -121.292990 18.274035 2.289155 2.336075 0
22 -108.225975 14.341716 1.028910 -1.865197 0
7 -24.469942 27.005375 -9.301921 4.399504 2
6 -32.305700 26.568039 -9.470180 3.453279 2
5 -42.884900 28.038020 -7.857254 3.336100 2
4 -37.075096 22.347767 -3.860426 5.695394 2
3 -35.296390 21.709473 -4.160352 5.578346 2
2 -42.092365 25.680704 -5.009290 5.665794 2
1 -23.806690 20.536781 -5.015675 4.221635 2
0 -46.220314 22.862856 -6.157307 5.606041 2
You can use an ordered Categorical, or if you don't want to change the DataFrame, the poor-man's variant, a mapping Series:
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
# or
# pd.Series(range(len(order)), index=order).get
df1.sort_values(by='Labels', key=key)
Example:
df1 = pd.DataFrame({'Labels': [1,0,1,2,0,2,1]})
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
print(df1.sort_values(by='Labels', key=key))
Labels
0 1
2 1
6 1
1 0
4 0
3 2
5 2
here is another way to do it
create a new column using map and map the new order sequence and then sort as usual
df['sort_label'] = df['Labels'].map({1:0, 0:1, 2:2 }) #).sort_values('sort_label', ascending=False)
df.sort_values('sort_label')
Feat1 Feat2 Feat3 Feat4 Labels sort_label
11 -100.685440 -4.581096 1.697769 1.880162 1 0
15 -61.951996 -8.960198 -1.591526 5.616025 1 0
8 -97.898920 -0.381567 6.416338 7.234347 1 0
9 -81.963250 0.182172 -1.287036 4.703838 1 0
10 -78.419860 -6.766374 0.800119 0.834449 1 0
14 -94.504845 -0.999922 3.288430 6.881124 1 0
12 -87.054120 -2.923158 6.817379 5.446008 1 0
13 -64.121056 -3.789221 -0.283514 6.308415 1 0
21 -208.003500 -1.332290 -3.207809 1.517740 0 1
20 -137.597600 6.795896 1.679367 2.225676 0 1
19 -199.187870 3.309057 -2.599479 4.080255 0 1
18 -92.641335 22.106240 -3.511054 2.467166 0 1
17 -97.403700 22.897585 -2.848827 1.410504 0 1
16 -108.194520 13.909201 0.696646 -1.956591 0 1
23 -121.292990 18.274035 2.289155 2.336075 0 1
22 -108.225975 14.341716 1.028910 -1.865197 0 1
7 -24.469942 27.005375 -9.301921 4.399504 2 2
6 -32.305700 26.568039 -9.470180 3.453279 2 2
5 -42.884900 28.038020 -7.857254 3.336100 2 2
4 -37.075096 22.347767 -3.860426 5.695394 2 2
3 -35.296390 21.709473 -4.160352 5.578346 2 2
2 -42.092365 25.680704 -5.009290 5.665794 2 2
1 -23.806690 20.536781 -5.015675 4.221635 2 2
0 -46.220314 22.862856 -6.157307 5.606041 2 2

Count number of non zero columns in a given set of columns of a data frame - pandas

I have a df as shown below
df:
Id Jan20 Feb20 Mar20 Apr20 May20 Jun20 Jul20 Aug20 Sep20 Oct20 Nov20 Dec20 Amount
1 20 0 0 12 1 3 1 0 0 2 2 0 100
2 0 0 2 1 0 2 0 0 1 0 0 0 500
3 1 2 1 2 3 1 1 2 2 3 1 1 300
From the above I would like to calculate Activeness value which is the number of non zero columns in the month columns as given below.
'Jan20', 'Feb20', 'Mar20', 'Apr20', 'May20', 'Jun20', 'Jul20',
'Aug20', 'Sep20', 'Oct20', 'Nov20', 'Dec20'
Expected Output:
Id Jan20 Feb20 Mar20 Apr20 May20 Jun20 Jul20 Aug20 Sep20 Oct20 Nov20 Dec20 Amount Activeness
1 20 0 0 12 1 3 1 0 0 2 2 0 100 7
2 0 0 2 1 0 2 0 0 1 0 0 0 500 4
3 1 2 1 2 3 1 1 2 2 3 1 1 300 12
I tried below code:
df['Activeness'] = pd.Series(index=df.index, data=np.count_nonzero(df[['Jan20', 'Feb20',
'Mar20', 'Apr20', 'May20', 'Jun20', 'Jul20',
'Aug20', 'Sep20', 'Oct20', 'Nov20', 'Dec20']], axis=1))
which is working well, but I would like to know is there any method that is faster than this.
You can try:
df['Activeness'] = df.filter(like = '20').ne(0, axis =1).sum(1)

Calculate column by condition

I have table in df:
X1 X2
1 1
1 2
2 2
2 2
3 3
3 3
And i want calculate Y, where Y = Yprevious + 1 if X1=X1previous and X2=X2previous, elso 0. Y on first line = 0. Example.
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
Not a duplicate... Previously, the question was simpler - addition with a value in a specific line. Now the term appears in the calculation process. I need some cumulative calculation
What I need, more example:
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
3 3 1
2 2 0
What I get on the link to the duplicate
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
3 3 1
2 2 4
Use GroupBy.cumcount with new columns by consecutive values:
df1 = df[['X1','X2']].ne(df[['X1','X2']].shift()).cumsum()
df['Y'] = df.groupby([df1['X1'], df1['X2']]).cumcount()
print (df)
X1 X2 Y
0 1 1 0
1 2 2 0
2 2 2 1
3 2 2 2
4 2 2 3
5 3 3 0
6 3 3 1
7 2 2 0

Is there a way to make one integer increase when a second integer increases by a set amount in python 3?

I'm trying to create a simple game in python 3 and I'm trying to build in an EXP system, for example, every 50 experience points, your health (Which is already an integer) increases by one. Is there a command for this?
(I'm coding this on repl.it if that matters)
I've never shunned guessing. :)
Let me suppose that you are incrementing a variable called experience_points and that, once for every 50 times you increment that you want to increment a variable called health by one.
experience_points += 1
if experience_points % 50 == 0:
health +=1
This bit of code shows how this might work. Notice how health goes up one for every 50 times that 'experience_points` goes up one.
Welcome to the modulus operator!
>>> experience_points = 0
>>> health = 0
>>> while True:
... # do something in the game
... experience_points += 1
... if experience_points % 50 == 0:
... health += 1
... print (experience_points, health, '<--', end='')
... if experience_points > 160:
... break
...
1 0 <--2 0 <--3 0 <--4 0 <--5 0 <--6 0 <--7 0 <--8 0 <--9 0 <--10 0 <--11 0 <--12 0 <--13 0 <--14 0 <--15 0 <--16 0 <--17 0 <--18 0 <--19 0 <--20 0 <--21 0 <--22 0 <--23 0 <--24 0 <--25 0 <--26 0 <--27 0 <--28 0 <--29 0 <--30 0 <--31 0 <--32 0 <--33 0 <--34 0 <--35 0 <--36 0 <--37 0 <--38 0 <--39 0 <--40 0 <--41 0 <--42 0 <--43 0 <--44 0 <--45 0 <--46 0 <--47 0 <--48 0 <--49 0 <--50 1 <--51 1 <--52 1 <--53 1 <--54 1 <--55 1 <--56 1 <--57 1 <--58 1 <--59 1 <--60 1 <--61 1 <--62 1 <--63 1 <--64 1 <--65 1 <--66 1 <--67 1 <--68 1 <--69 1 <--70 1 <--71 1 <--72 1 <--73 1 <--74 1 <--75 1 <--76 1 <--77 1 <--78 1 <--79 1 <--80 1 <--81 1 <--82 1 <--83 1 <--84 1 <--85 1 <--86 1 <--87 1 <--88 1 <--89 1 <--90 1 <--91 1 <--92 1 <--93 1 <--94 1 <--95 1 <--96 1 <--97 1 <--98 1 <--99 1 <--100 2 <--101 2 <--102 2 <--103 2 <--104 2 <--105 2 <--106 2 <--107 2 <--108 2 <--109 2 <--110 2 <--111 2 <--112 2 <--113 2 <--114 2 <--115 2 <--116 2 <--117 2 <--118 2 <--119 2 <--120 2 <--121 2 <--122 2 <--123 2 <--124 2 <--125 2 <--126 2 <--127 2 <--128 2 <--129 2 <--130 2 <--131 2 <--132 2 <--133 2 <--134 2 <--135 2 <--136 2 <--137 2 <--138 2 <--139 2 <--140 2 <--141 2 <--142 2 <--143 2 <--144 2 <--145 2 <--146 2 <--147 2 <--148 2 <--149 2 <--150 3 <--151 3 <--152 3 <--153 3 <--154 3 <--155 3 <--156 3 <--157 3 <--158 3 <--159 3 <--160 3 <--161 3 <--

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources