Compare nth letter in one column to a single letter in another - python-3.x

I have a df as follows:
Policy Letter Password Lower Upper Count Lower_Minus_1 Upper_Minus_1
0 4-5 l rllllj 4 5 4 3 4
1 4-10 s ssskssphrlpscsxrfsr 4 10 8 3 9
2 14-18 p ppppppppppppppppppp 14 18 19 13 17
3 1-6 z zzlzvmqbzzclrz 1 6 6 0 5
4 4-5 j jhjjhxhjkxj 4 5 5 3 4
Lower_Minus_1 value is to be used as an index to search that position in the password to see if it matches the letter in column 'Letter'.
This line works:
print(df['Password'].str[3] == df['Letter'])
However, it strictly returns True\False based upon the third position for the value in 'Password' for every single row.
First five:
0 True
1 False
2 True
3 True
4 True
I don't want the third position for every row. I want the Lower_Minus_1 position for each row.
I have tried the following but both fail:
print(df['Password'].str[df['Letter']] == df['Letter'])
Returns False for every single row as proven by:
print((df['Password'].str[df['Letter']] == df['Letter']).sum())
Returns: 0
Then I tried this:
print(df.apply(lambda x: x['Password'].str[x['Lower_Minus_1']], axis=1) == df['Letter'])
This throws an error:
File "D:/AofC/2020_day2.py", line 56, in <lambda>
print(df.apply(lambda x: x['Password'].str[x['Lower_Minus_1']], axis=1) == df['Letter'])
AttributeError: 'str' object has no attribute 'str'

df.apply(lambda x:x['Letter']== x['Password'][x.Lower_Minus_1], axis=1)
0 True
1 False
2 True
3 True
4 True
dtype: bool

Related

How can I delete useless strings by index from a Pandas DataFrame defining a function?

I have a DataFrame, namely 'traj', as follow:
x y z
0 5 3 4
1 4 2 8
2 1 1 7
3 Some string here
4 This is spam
5 5 7 8
6 9 9 7
... #continues repeatedly a lot with the same strings here in index 3 and 4
79 4 3 3
80 Some string here
I'm defining a function in order to delete useless strings positioned in certain index from the DataFrame. Here is what I'm trying:
def spam(names,df): #names is a list composed, for instance, by "Some" and "This" in 'traj'
return df.drop(index = ([traj[(traj.iloc[:,0] == n)].index for n in names]))
But when I call it it returns the error:
traj_clean = spam(my_list_of_names, traj)
...
KeyError: '[(3,4,...80)] not found in axis'
If I try alone:
traj.drop(index = ([traj[(traj.iloc[:,0] == 'Some')].index for n in names]))
it works.
I solved it in a different way:
df = traj[~traj[:].isin(names)].dropna()
Where names is a list of the terms you wish to delete.
df will contain only rows without these terms

Pandas remove group if difference between first and last row in group exceeds value

I have a dataframe df:
df = pd.DataFrame({})
df['X'] = [3,8,11,6,7,8]
df['name'] = [1,1,1,2,2,2]
X name
0 3 1
1 8 1
2 11 1
3 6 2
4 7 2
5 8 2
For each group within 'name' and want to remove that group if the difference between the first and last row of that group is smaller than a specified value d_dif in absolute way:
For example, when d_dif= 5, I want to get:
X name
0 3 1
1 8 1
2 11 1
If your data is increasingly in X, you can use groupby().transform() and np.ptp
threshold = 5
ranges = df.groupby('name')['X'].transform(np.ptp)
df[ranges > threshold]
If you only care about first and last, then transform just first and last:
threshold = 5
groups = df.groupby('name')['X']
ranges = groups.transform('last') - groups.transform('first')
df[ranges.abs() > threshold]

adding 1 to the previous row based on conditions

I have a pandas dataframe like below:
data=[['A',1,30],
['A',1,2],
['A',0,4],
['A',1,4],
['B',0,5],
['B',1,1],
['B',0,5],
['B',1,8]]
df = pd.DataFrame(data,columns=['group','var_1','var_2'])
I want to create a series of values with index based on below condition:
Step 1) Increment should always happen from 1st row of 'var_2'of each group. For example: for group A, the increment should start from 30 and for group B,
increment should start from 5
Step 2) Incremented value where 'var_1" = 1
My desired output:
0 30
1 31
3 32
5 6
7 7
IIUC:
#Get first index in each group and union index where var_1 ==1
indx = df.drop_duplicates('group').index.union(df[(df['var_1']==1)].index)
#Reindex dataframe group by group, add cusum value to other present values in group.
#Use .loc to filter where var_1 != 0 and get column var_2
df.reindex(indx).groupby('group')\
.transform(lambda x: x.iloc[0] + x.shift().notna().cumsum())\
.loc[lambda x: x.var_1 !=0, 'var_2']
Output:
0 30
1 31
3 32
5 6
7 7
Name: var_2, dtype: int64
Try groupby cumcount and first
df1 = df.loc[df.var_1.eq(1)]
g = df1.groupby('group')['var_2']
g.transform('first') + g.cumcount()
Out[66]:
0 30
1 31
3 32
5 1
7 2
dtype: int64
Or use duplicated with df.where and cumsum
df1 = df.loc[df.var_1.eq(1)]
df1.var_2.where(~df1.duplicated('group'), 1).groupby(df1.group).cumsum()
Out[77]:
0 30
1 31
3 32
5 1
7 2
Name: var_2, dtype: int64

Get row with a symbol after a particular index

I have a df:
Index col1
1 Abc
2 xyz
3 $123
4 wer
5 exr
6 ert
7 $546
8 $456
Problem Statement:
Now I want to find the index of the row containing the dollar sign after the keyword wer.
My Code:
idx = df.col1.str.contains('\$').idxmax() ## this gives me index 3 but what i want is index 7
Help need to modify my code to get the desired output
You need to mask the wer as well
s = (df['col1'].str.contains('\$') # rows containing $
& df['col1'].eq('wer').cumsum().gt(0) # rows after the first 'wer'
).idxmax()
# s == 7
Use:
#df=df.set_index('Index') #if 'index' is a column
df2=df[df['col1'].eq('wer').cumsum()>0]
df2['col1'].str.contains('\$').idxmax()
or:
df[(df['col1'].eq('wer').cumsum()>0) & df['col1'].str.contains('\$')].index[0]
Output:
7
Details:
df['col1'].eq('wer').cumsum().eq(1)
Index
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
Name: col1, dtype: bool
print(df2)
col1
Index
4 wer
5 exr
6 ert
7 $546
8 $456

In pyspark generate minimum value for a window partition based on two column value, variable and consecutive negative values

Created a rdd, having column 'a' which has a mix of positive and negative values
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],
"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2],
"pos_neg": ['false','true','false','true','true','true','true','true','false','false','true','false','false','false','true','false','false','false','true','true'],
"neg_val_count":[0,1,1,2,1,1,1,1,1,0,1,1,1,1,2,2,2,2,3,3]})
df2=spark.createDataFrame(df)
the column 'pos_neg' represents if fields in 'a' is positive or negative, if negative it is true. 'neg_val_count' is counter for negative values within the fields for variable 'b'. Every time the variable 'b'changes the counter resets and consecutive negative values are taken as single. Hence for variable 'B' (in column 'b') counter is one even though there are three negative values.
I would like to generate a column which will have the minimum value for a combination of variable in 'b' (say A) and value in 'a' (for true cases between two false). for instance for first combination of 'A' and true, the value will be -4 (it is surrounded by false),for second combination of 'A' and true value will be -1, for B there are three consecutive true so the value will be least of them as -7. Basically consecutive negative values are taken as one and minimum is taking out of them. Expected value refers to the outcome required
b Sno a pos_neg neg_val_count expected value
0 A 1 3 false 0 3
1 A 2 -4 true 1 -4
2 A 3 2 false 1 2
3 A 4 -1 true 2 -1
4 B 5 -3 true 1 -7
5 B 6 -1 true 1 -7
6 B 7 -7 true 1 -7
7 C 8 -6 true 1 -6
8 C 9 1 false 1 1
9 D 10 1 false 0 1
10 D 11 -1 true 1 -1
11 D 12 1 false 1 1
12 D 13 4 false 1 4
13 D 14 5 false 1 5
14 D 15 -3 true 2 -3
15 D 16 2 false 2 2
16 D 17 3 false 2 3
17 D 18 4 false 2 4
18 D 19 -1 true 3 -2
19 D 20 -2 true 3 -2
I Tried using the following but it is not working, any support in this regard will be great.
w3 = Window.partitionBy('b','pos_neg').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2.withColumn('new_col', F.min('a').over(w3))

Resources