In pyspark generate minimum value for a window partition based on two column value, variable and consecutive negative values - python-3.x

Created a rdd, having column 'a' which has a mix of positive and negative values
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],
"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2],
"pos_neg": ['false','true','false','true','true','true','true','true','false','false','true','false','false','false','true','false','false','false','true','true'],
"neg_val_count":[0,1,1,2,1,1,1,1,1,0,1,1,1,1,2,2,2,2,3,3]})
df2=spark.createDataFrame(df)
the column 'pos_neg' represents if fields in 'a' is positive or negative, if negative it is true. 'neg_val_count' is counter for negative values within the fields for variable 'b'. Every time the variable 'b'changes the counter resets and consecutive negative values are taken as single. Hence for variable 'B' (in column 'b') counter is one even though there are three negative values.
I would like to generate a column which will have the minimum value for a combination of variable in 'b' (say A) and value in 'a' (for true cases between two false). for instance for first combination of 'A' and true, the value will be -4 (it is surrounded by false),for second combination of 'A' and true value will be -1, for B there are three consecutive true so the value will be least of them as -7. Basically consecutive negative values are taken as one and minimum is taking out of them. Expected value refers to the outcome required
b Sno a pos_neg neg_val_count expected value
0 A 1 3 false 0 3
1 A 2 -4 true 1 -4
2 A 3 2 false 1 2
3 A 4 -1 true 2 -1
4 B 5 -3 true 1 -7
5 B 6 -1 true 1 -7
6 B 7 -7 true 1 -7
7 C 8 -6 true 1 -6
8 C 9 1 false 1 1
9 D 10 1 false 0 1
10 D 11 -1 true 1 -1
11 D 12 1 false 1 1
12 D 13 4 false 1 4
13 D 14 5 false 1 5
14 D 15 -3 true 2 -3
15 D 16 2 false 2 2
16 D 17 3 false 2 3
17 D 18 4 false 2 4
18 D 19 -1 true 3 -2
19 D 20 -2 true 3 -2
I Tried using the following but it is not working, any support in this regard will be great.
w3 = Window.partitionBy('b','pos_neg').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2.withColumn('new_col', F.min('a').over(w3))

Related

Compare nth letter in one column to a single letter in another

I have a df as follows:
Policy Letter Password Lower Upper Count Lower_Minus_1 Upper_Minus_1
0 4-5 l rllllj 4 5 4 3 4
1 4-10 s ssskssphrlpscsxrfsr 4 10 8 3 9
2 14-18 p ppppppppppppppppppp 14 18 19 13 17
3 1-6 z zzlzvmqbzzclrz 1 6 6 0 5
4 4-5 j jhjjhxhjkxj 4 5 5 3 4
Lower_Minus_1 value is to be used as an index to search that position in the password to see if it matches the letter in column 'Letter'.
This line works:
print(df['Password'].str[3] == df['Letter'])
However, it strictly returns True\False based upon the third position for the value in 'Password' for every single row.
First five:
0 True
1 False
2 True
3 True
4 True
I don't want the third position for every row. I want the Lower_Minus_1 position for each row.
I have tried the following but both fail:
print(df['Password'].str[df['Letter']] == df['Letter'])
Returns False for every single row as proven by:
print((df['Password'].str[df['Letter']] == df['Letter']).sum())
Returns: 0
Then I tried this:
print(df.apply(lambda x: x['Password'].str[x['Lower_Minus_1']], axis=1) == df['Letter'])
This throws an error:
File "D:/AofC/2020_day2.py", line 56, in <lambda>
print(df.apply(lambda x: x['Password'].str[x['Lower_Minus_1']], axis=1) == df['Letter'])
AttributeError: 'str' object has no attribute 'str'
df.apply(lambda x:x['Letter']== x['Password'][x.Lower_Minus_1], axis=1)
0 True
1 False
2 True
3 True
4 True
dtype: bool

Pandas dataframe: Count no of rows which meet a set of conditions across multiple columns [duplicate]

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

How to output the order where at least one column is filled with all positive number

I want to create a formula to output the order, where at least one column is filled with all positive values. In this case, the result should be order 11 and 12. Thank you.
order a b c
11 1 1 2
11 1 -1 -3
12 -2 1 -1
12 1 1 3
13 2 3 2
13 -1 -2 -3
Try below formula
=IFERROR(INDEX($A$2:$A$7,AGGREGATE(15,6,ROW($1:$6)/(($B$2:$B$7>0)*($C$2:$C$7>0)*($D$2:$D$7>0)),ROW(1:1)),COLUMN(A$1)),"")

Skipping every nth row in pandas

I am trying to slice my dataframe by skipping every 4th row. The best way I could get it done is by getting the index of every 4th row and then selecting all the other rows. Like below:-
df[~df.index.isin(df[::4].index)]
I was wondering if there is a simpler and/or more pythonic way of getting this done.
One possible solution is create mask by modulo and filter by boolean indexing:
df = pd.DataFrame({'a':range(10, 30)}, index=range(20))
#print (df)
b = df[np.mod(np.arange(df.index.size),4)!=0]
print (b)
a
1 11
2 12
3 13
5 15
6 16
7 17
9 19
10 20
11 21
13 23
14 24
15 25
17 27
18 28
19 29
Details:
print (np.mod(np.arange(df.index.size),4))
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
print (np.mod(np.arange(df.index.size),4)!=0)
[False True True True False True True True False True True True
False True True True False True True True]
If unique index values use a bit changed #jpp solution from comment:
b = df.drop(df.index[::4], 0)

Conditional cumulative sum in Python/Pandas

Consider my dataframe, df:
data data_binary sum_data
2 1 1
5 0 0
1 1 1
4 1 2
3 1 3
10 0 0
7 0 0
3 1 1
How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?
The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].
I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns
array([1, 0, 2, 3, 4, 0, 0, 5])
Which is not what I want.
You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
Output
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
Output
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:
print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: data_binary, dtype: int32
df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
.cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
data data_binary sum_data sum_data1
0 2 1 1 1
1 5 0 0 0
2 1 1 1 1
3 4 1 2 2
4 3 1 3 3
5 10 0 0 0
6 7 0 0 0
7 3 1 1 1
If you want the excellent piRSquared's answer in just one single command:
df['sum_data'] = df[['data_binary']].apply(
lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int),
axis=0)
Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).

Resources