pandas Dataframe, assign value based on selection of other rows - python-3.x

I have a pandas DataFrame in python 3.
In this DataFrame there are rows which have identical values in two columns (this can be whole sections), I'll call this a group.
Each row also has a True/False value in a column.
Now for each row I want to know if any of the rows in its group have a False value, if so, I want to assign a False value to every row in that group in another column. I've managed to do this in a for-loop, but it's quite slow:
import pandas as pd
import numpy as np
df = pd.DataFrame({'E': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
Which gives:
df:
A B C D E
0 0 aa True 0 NaN
1 0 aa True 1 NaN
2 0 aa False 2 NaN
3 0 bb False 3 NaN
4 1 cc True 4 NaN
5 1 dd True 5 NaN
6 1 dd True 6 NaN
Now I run the for-loop:
for i in df.index:
df.ix[i, 'E'] = df[(df['A'] == df.iloc[i]['A']) & (df['B'] == df.iloc[i]['B'])]['C'].all()
which then gives the desired result:
df:
A B C D E
0 0 aa True 0 False
1 0 aa True 1 False
2 0 aa False 2 False
3 0 bb False 3 False
4 1 cc True 4 True
5 1 dd True 5 True
6 1 dd True 6 True
When running this for my entire DataFrame of ~1 million rows this takes ages. So, looking at using .apply() to avoid a for-loop I've stumbled across the following question: apply a function to a pandas Dataframe whose retuned value is based on other rows
however:
def f(x): return False not in x
df.groupby(['A','B']).C.apply(f)
returns:
A B
0 aa False
bb True
1 cc True
dd True
Does anyone know a better way or how to fix the last case?

You could try doing a SQL-style join using pd.merge.
Perform the same groupby that you're doing, but apply min() to it to look for any cases with C == True. Then convert that to a DataFrame, rename the column as "E", and merge it back to df.
df = pd.DataFrame({'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
falses = pd.DataFrame(df.groupby(['A', 'B']).C.min() == True)
falses = falses.rename(columns={'C': 'E'})
df = df.merge(falses, left_on=['A', 'B'], right_index=True)

Related

I need to find the count values using a specific row that matches the columns for the other dataframe the values which are true

For example I have two dataframes
df = [{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}]
df2 = [{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}]
idea should be to count the values based on the row of df which matches, df2 columns
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
df2:
A
B
C
D
True
False
NA
True
Output:
id
A
B
C
Count
1
False
False
NA
0
2
True
False
NA
1
I tried something like
for i in range(columns):
x = action_value_counts_df.columns[i]
if compare_column.equals(action_value_counts_df[x]):
print(x, 'Matched')
else:
print(x,'Not Matched')
This code did not help
Merge df, df2 using overlapping columns.
Count overlapping columns.
Merge count result DataFrame into df.
Replace NA with 0 in the 'Count' column.
import pandas as pd
df = pd.DataFrame([
{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},
{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}])
df2 = pd.DataFrame([
{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}])
# 1
match_df = pd.merge(left=df, right=df2, on=['A', 'B', 'C'], how='inner')
match_df = match_df.assign(Count=1)
"""
id A B C D Count
0 2 True False NA True 1
"""
# 2
match_count_df = match_df.groupby(['A', 'B', 'C'], as_index=False).count()
match_count_df = match_count_df[['A', 'B', 'C', 'Count']]
"""
A B C Count
0 True False NA 1
"""
# 3
result_df = pd.merge(left=df, right=match_count_df, how='left')
"""
id A B C Count
0 1 False False NA NaN
1 2 True False NA 1.0
"""
# 4
result_df.loc[:, 'Count'] = result_df['Count'].fillna(0)
"""
id A B C Count
0 1 False False NA 0.0
1 2 True False NA 1.0
"""
You can compare two dataframes rowwise with arranged columns (from df to df2) and ignored NaN values (as they are not comparable):
df.assign(Count=df.set_index('id').apply(lambda x: (x.dropna() == df2[x.index].squeeze().dropna()).all() * 1, axis=1).values)
id A B C Count
0 1 False False NaN 0
1 2 True False NaN 1

Drop a column in pandas if all values equal 1?

How do I drop columns in pandas where all values in that column are equal to a particular number? For instance, consider this dataframe:
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, 1]})
print(df)
Output:
A B C
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
How would I drop the 1 columns so that the output is:
B
0 0
1 1
2 2
3 3
Use DataFrame.loc with test if at least one non 1 value by DataFrame.ne with DataFrame.any:
df1 = df.loc[:, df.ne(1).any()]
Or test for 1 by DataFrame.eq with DataFrame.all for all Trues per columns and inverted mask by ~:
df1 = df.loc[:, ~df.eq(1).all()]
print (df1)
B
0 0
1 1
2 2
3 3
EDIT:
One consideration is what do you want to happen if you have a column with Nan and 1 only?
Then replace NaNs to 0 by DataFrame.fillna and use same solution like before:
df1 = df.loc[:, df.fillna(0).ne(1).any()]
df1 = df.loc[:, ~df.fillna(0).eq(1).all()]
You can use any:
df.loc[:, df.ne(1).any()]
One consideration is what do you want to happen if you have a column with Nan and 1 only?
If you want to drop under this condition also, you will to either fillna with 1 or add or and new condition.
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, np.nan]})
print(df)
A B C
0 1 0 1.0
1 1 1 1.0
2 1 2 1.0
3 1 3 NaN
All these leave that column with NaN and 1's.
df.loc[:, df.ne(1).any()]
df.loc[:, ~df.eq(1).all()]
So, you can add this addition to drop that column also.
df.loc[:, ~(df.eq(1) | df.isna()).all()]
Output:
B
0 0
1 1
2 2
3 3

Pandas fill of NA values horizontally, but limited to one forward fill value only

I want to fill values forward horizontally, but limited to one fill value only.
See the frames below: dfa has some gaps that need filling. I want the results as shown in dfb.
(Note the .T at the end of the lines, the transpose the data horizontally.)
However, dfa.fillna(0, limit=1, axis=1) fills all cells in the Name row, whereas columns 5 and 6 (i.e. the two columns to the left of 7 in the Name row) should remain NaN.
import pandas as pd
import numpy as np
dfa = pd.DataFrame({'Name':[1, np.nan, 3, np.nan, np.nan, np.nan, 7, np.nan],
'Age': [np.nan, 2, np.nan, 4, np.nan, 6, np.nan, 8]}).T
dfb = pd.DataFrame({'Name':[1, 0, 3, 0, np.nan, np.nan, 7, 0],
'Age': [np.nan, 2, 0, 4, 0, 6, 0, 8]}).T
dfc = dfa.fillna(0, limit=1, axis=1)
One idea is use forward filling for mask and then replace with DataFrame.mask chained conditions with &:
m = dfa.ffill(limit=1, axis=1).isna()
print (m)
0 1 2 3 4 5 6 7
Name False False False False True True False False
Age True False False False False False False False
dfc = dfa.mask(dfa.isna() & ~m, 0)
Or first replace all NaNs and then create NaNs by condition:
dfc = dfa.fillna(0).mask(m)
print (dfc)
0 1 2 3 4 5 6 7
Name 1.0 0.0 3.0 0.0 NaN NaN 7.0 0.0
Age NaN 2.0 0.0 4.0 0.0 6.0 0.0 8.0

Pandas - Conditional drop duplicates based on number of NaN

I have a Pandas 0.24.2 dataframe for Python 3.7x as below. I want to drop_duplicates() with the same Name based on a conditional logic. A similar question can be found here: Pandas - Conditional drop duplicates but it gets more complicated in my case
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6 ],
'Name': ['A', 'B', 'C', 'A', 'B', 'C' ],
'Value1':[1, np.NaN, 0, np.NaN, 1, np.NaN],
'Value2':[np.NaN, 0, np.NaN, 1, np.NaN, 0 ],
'Value3':[np.NaN, 0, np.NaN, 1, np.NaN, np.NaN]
})
How is it possible to:
Drop duplicates for same 'Name' records, keeping the one that has less NaNs?
If they have the same number of NaNs, keeping the one that has NOT a NaN in 'Value1'?
The desired output would be:
Id Name Value1 Value2 Value3
2 2 B NaN 0 0
3 3 C 0 NaN NaN
4 4 A NaN 1 1
Idea is create helper columns for both conditions, sorting and remove duplicates:
df1 = df.assign(count= df.isna().sum(axis=1),
count_val1 = df['Value1'].isna().view('i1'))
df2 = (df1.sort_values(['count', 'count_val1'])[df.columns]
.drop_duplicates('Name')
.sort_index())
print (df2)
Id Name Value1 Value2 Value3
1 2 B NaN 0.0 0.0
2 3 C 0.0 NaN NaN
3 4 A NaN 1.0 1.0
Here is a different solution. The goal is to create two columns that help sort the duplicate rows that will be deleted.
First, we create the columns.
df['count_nan'] = df.isnull().sum(axis=1)
Value1_nan = []
for row in df['Value1']:
if row >= 0:
Value1_nan.append(0)
else:
Value1_nan.append(1)
df['Value1_nan'] = Value1_nan
We then sort the columns so that the column with the most NaNs appears first.
df.sort_values(by=['Name','count_nan', 'Value1'], inplace=True, ascending = [True, True, True])
Finally, we drop the "last" duplicate line. That is, we keep the line with the smallest number of NaNs followed by the line with the smallest number of NaNs in Value1
df = df.drop_duplicates(subset = ['Name'],keep='first')

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

Resources