pandas assign columns values depend on another column in the df - python-3.x

I have the following df,
id a_id b_id
1 25 50
1 25 50
2 26 51
2 26 51
3 25 52
3 28 52
3 28 52
I have the following code to assign a_id and b_id to -1, based on how many rows each of them has for each id value in the df; if each of a_id or b_id value has exactly the same rows/sub-df as a specific value of id has, those rows of a_id and b_id get -1;
cluster_ids = df.loc[df['id'] > -1]['id'].unique()
types = ['a_id', 'b_id']
for cluster_id in cluster_ids:
rows = df.loc[df['id'] == cluster_id]
for type in types:
ids = rows[type].values
match_rows = df.loc[df[type] == ids[0]]
if match_rows.equals(rows):
df.loc[match_rows.index, type] = -1
so the result df will look like,
id a_id b_id
1 25 -1
1 25 -1
2 -1 -1
2 -1 -1
3 25 -1
3 28 -1
3 28 -1
I am wondering if there a more efficient way to do it.

one_value_for_each_id = df.groupby('id').transform(lambda x: len(set(x)) == 1)
a_id b_id
0 True True
1 True True
2 True True
3 True True
4 False True
5 False True
6 False True
one_id_for_each_value = pd.DataFrame({
col: df.groupby(col).id.transform(lambda x: len(set(x)) == 1)
for col in ['a_id', 'b_id']
})
a_id b_id
0 False True
1 False True
2 True True
3 True True
4 False True
5 True True
6 True True
one_to_one_relationship = one_id_for_each_value & one_value_for_each_id
# Set all values that satisfy the one-to-one relationship to `-1`
df.loc[one_to_one_relationship.a_id, 'a_id'] = -1
df.loc[one_to_one_relationship.b_id, 'b_id'] = -1
a_id b_id
0 25 -1
1 25 -1
2 -1 -1
3 -1 -1
4 25 -1
5 28 -1
6 28 -1

Related

Checking for specific value change between columns in pandas

I've got 4 columns with numeric values between 1 and 4, and I'm trying to see which rows change from a value of 1 to a value of 4 progressing from column a to column d within those 4 columns. Currently I'm pulling the difference between each of the columns and looking for a value of 3. Is there a better way to do this?
Here's what I'm looking for (with 0's in place of nan):
ID a b c d check
1 1 0 1 4 True
2 1 0 1 1 False
3 1 1 1 4 True
4 1 3 3 4 True
5 0 0 1 4 True
6 1 2 3 3 False
7 1 0 0 4 True
8 1 4 4 4 True
9 1 4 3 4 True
10 1 4 1 1 True
You can just do cummax
col = ['a','b','c','d']
s = df[col].cummax(1)
df['new'] = s[col[:3]].eq(1).any(1) & s[col[-1]].eq(4)
Out[523]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
dtype: bool
You can try compare the index of 4 and 1 in apply
cols = ['a', 'b', 'c', 'd']
def get_index(lst, num):
return lst.index(num) if num in lst else -1
df['Check'] = df[cols].apply(lambda row: get_index(row.tolist(), 4) > get_index(row.tolist(), 1), axis=1)
print(df)
ID a b c d check Check
0 1 1 0 1 4 True True
1 2 1 0 1 1 False False
2 3 1 1 1 4 True True
3 4 1 3 3 4 True True
4 5 0 0 1 4 True True
5 6 1 2 3 3 False False
6 7 1 0 0 4 True True
7 8 1 4 4 4 True True
8 9 1 4 3 4 True True

How to add a number to a group of rows in a column only when the rows are grouped and have the same value?

I have a dataframe with multiple columns. One of these columns consists of boolean numbers. For example:
data = pd.DataFrame([0,0,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0])
What I need to do is identify every group of 1s and add a constant number, except the first group of 1s.
The output should be a dataframe as follows:
0,0,0,0,1,1,1,0,0,0,0,0,2,2,0,0,0,0,3,3,3,3,0,0
Is there a way to make this without being messy and complicated?
Use a boolean mask:
# Look for current row = 1 and previous row = 0
m = df['A'].diff().eq(1)
df['G'] = m.cumsum().mask(df['A'].eq(0), 0)
print(df)
# Output
A G # m
0 0 0 # False
1 0 0 # False
2 0 0 # False
3 0 0 # False
4 1 1 # True <- Group 1
5 1 1 # False
6 1 1 # False
7 0 0 # False
8 0 0 # False
9 0 0 # False
10 0 0 # False
11 0 0 # False
12 1 2 # True <- Group 2
13 1 2 # False
14 0 0 # False
15 0 0 # False
16 0 0 # False
17 0 0 # False
18 1 3 # True <- Group 3
19 1 3 # False
20 1 3 # False
21 1 3 # False
22 0 0 # False
23 0 0 # False

pandas create a column based on values in another column which selected as conditions

I have the following df,
id match_type amount negative_amount
1 exact 10 False
1 exact 20 False
1 name 30 False
1 name 40 False
1 amount 15 True
1 amount 15 True
2 exact 0 False
2 exact 0 False
I want to create a column 0_amount_sum that indicates (boolean) if the amount sum is <= 0 or not for each id of a particular match_type, e.g. the following is the result df;
id match_type amount 0_amount_sum negative_amount
1 exact 10 False False
1 exact 20 False False
1 name 30 False False
1 name 40 False False
1 amount 15 True True
1 amount 15 True True
2 exact 0 True False
2 exact 0 True False
for id=1 and match_type=exact, the amount sum is 30, so 0_amount_sum is False. The code is as follows,
df = df.loc[df.match_type=='exact']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='name']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='amount']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
I am wondering if there is a better way/more efficient to do that, especially when the values of match_type is unknown, so the code can automatically enumerate all the possible values and then do the calculation accordingly.
I believe need groupby by 2 Series (columns) instead filtering:
df['0_amount_sum_'] = ((df.amount * np.where(df.negative_amount, -1, 1))
.groupby([df['id'], df['match_type']])
.transform('sum')
.le(0))
id match_type amount negative_amount 0_amount_sum_
0 1 exact 10 False False
1 1 exact 20 False False
2 1 name 30 False False
3 1 name 40 False False
4 1 amount 15 True True
5 1 amount 15 True True
6 2 exact 0 False True
7 2 exact 0 False True

Skipping every nth row in pandas

I am trying to slice my dataframe by skipping every 4th row. The best way I could get it done is by getting the index of every 4th row and then selecting all the other rows. Like below:-
df[~df.index.isin(df[::4].index)]
I was wondering if there is a simpler and/or more pythonic way of getting this done.
One possible solution is create mask by modulo and filter by boolean indexing:
df = pd.DataFrame({'a':range(10, 30)}, index=range(20))
#print (df)
b = df[np.mod(np.arange(df.index.size),4)!=0]
print (b)
a
1 11
2 12
3 13
5 15
6 16
7 17
9 19
10 20
11 21
13 23
14 24
15 25
17 27
18 28
19 29
Details:
print (np.mod(np.arange(df.index.size),4))
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
print (np.mod(np.arange(df.index.size),4)!=0)
[False True True True False True True True False True True True
False True True True False True True True]
If unique index values use a bit changed #jpp solution from comment:
b = df.drop(df.index[::4], 0)

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0
You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Resources