How to compare across multiple columns containing list value in Python Pandas? - python-3.x

I have the following sample data
ID VAR1 VAR2 VAR3 DATE
1 NaN [Timestamp('2012-08-03'), 'M'] [Timestamp('2012-08-03'), 'M'] 2012-08-03
2 [Timestamp('2009-04-01'), 'F'] NaN [Timestamp('2009-04-03'), 'F'] 2009-04-01
3 NaN [Timestamp('2004-01-01'), 'M'] NaN 2004-01-01
4 NaN [Timestamp('2004-02-15'), 'M'] [Timestamp('2000-08-07'), 'M'] 2000-08-07
For each row, I want to go through VAR1, VAR2, and VAR3 and have each to compare against the DATE. Each of the three columns would either have a np.nan (missing value) or a list value (containing a date and gender). I want to compare the first element of the list against the DATE column. If the first-element date is more than a day difference than the DATE value, I want to replace that cell value as np.nan.
I like to use Pandas' apply function as I am clear with the underlying logics.
The desired processed df should be as follows:
ID VAR1 VAR2 VAR3 DATE
1 NaN [Timestamp('2012-08-03'), 'M'] [Timestamp('2012-08-03'), 'M'] 2012-08-03
2 [Timestamp('2009-04-01'), 'F'] NaN NaN 2009-04-01
3 NaN [Timestamp('2004-01-01'), 'M'] NaN 2004-01-01
4 NaN NaN [Timestamp('2000-08-07'), 'M'] 2000-08-07
This is my working code
df = df.apply(remove_value_if_unmatched_against_index_date, axis=1)
def remove_value_if_unmatched_against_index_date(df):
vars = ['VAR1', 'VAR2', 'VAR3']
for var in vars:
if isinstance(df[var], list): # doesn't work
# if df[var].notnull(): # doesn't work
# if df[var] != np.nan: # doesn't work
if abs(df[var][0] - df['DATE']) >= timedelta(days=1):
df[var] = np.nan
return df
The problem is none of the followings (if isinstance(df[var], list):, if df[var].notnull():, and if df[var] != np.nan:) works to help check if there is a list value within the cell.

Try with bfill
df['new'] = df.bfill(axis=1)['VAR1'].str[0]

Related

Keeping columns of pandas dataframe whose substring is in the list

I have a dataframe with many columns. I only want to retain those columns whose substring is in the list. For example the lst and dataframe is:
lst = ['col93','col71']
sample_id. col9381.3 col8371.8 col71937.9 col19993.1
1
2
3
4
Based on the substrings, the resulting dataframe will look like:
sample_id. col9381.3 col71937.9
1
2
3
4
I have a code that go through the list and filter out the columns for whom I have a substring in a list but I don't know how to create a dataframe for it. The code so far:
for i in lst:
df2 = df1.filter(regex=i)
if df2.shape[1] > 0:
print(df2)
The above code is able to filter out the columns but I don't know how combine all of these into one dataframe. Insights will be appreciated.
Try with startswith which accepts a tuple of options:
df.loc[:, df.columns.str.startswith(('sample_id.',)+tuple(lst))]
Or filter which accepts a regex as you were trying:
df.filter(regex='|'.join(['sample_id']+lst))
Output:
sample_id. col9381.3 col71937.9
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0
Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64
Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64
Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Pandas select preferred value from one of two columns to make a new column

I have a Pandas DataFrame with two columns of "complementary" data. For any given row, there are 3 possibilities:
1) Column A has a non-null value, and column B has a null value, NaN, that I want to replace with the non-null value from column A.
2) Column A has a null value, NaN, that I want to replace with the non-null value from column B.
3) Both columns A and B have null values, NaN, which means I'll keep NaN as the value for that row.
Here's a simplified version of my DataFrame:
df1 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan]})
I was thinking that as an intermediate step, I'd create a new column C with the entries I need:
df2 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan],
'C' : ['keep1', 'keep2', np.nan, 'keep4', np.nan]}
Then I'd drop the first two rows A and B:
df_final = df2.drop(['A', 'B'], axis=1)
My actual DataFrame has hundreds of rows, and I've tried several approaches (boolean filters, looping through the DataFrame using iterrows, using DataFrame.where()) without success. I'd think this would be a simple problem, but I'm not seeing it. Any help is appreciated.
Thanks
You can use combine_first() to fill the gaps in A from B:
df1['C'] = df1['A'].combine_first(df1['B'])
#0 keep1
#1 keep2
#2 NaN
#3 keep4
#4 NaN
Use Series.fillna for replace missing values from A by B values:
df1['C'] = df1.A.fillna(df1.B)
print (df1)
A B C
0 keep1 NaN keep1
1 NaN keep2 keep2
2 NaN NaN NaN
3 keep4 NaN keep4
4 NaN NaN NaN
For avoid drop is possible use DataFrame.pop for extract columns:
df1['C'] = df1.pop('A').fillna(df1.pop('B'))
print (df1)
C
0 keep1
1 keep2
2 NaN
3 keep4
4 NaN

How to sum columns in python based on column with not empty string

df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3],
'Sum over columns':[1,10,8,5,10]})
Hi everybody, could you please help me with following issue:
I'm trying to sum over columns to get a sum of data1 and data2.
If column with string (key1) is not NaN and if column with string (key2) is not NaN then sum data1 and data2. The result I want is shown in the sum column. Thank your for your help!
Try using the .apply method of df on axis=1 and numpy's array multiplication function to get your desired output:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3]})
df['Sum over columns'] = df.apply(lambda x: np.multiply(x[0:2], ~x[2:4].isnull()).sum(), axis=1)
Or:
df['Sum over columns'] = np.multiply(df[['data1','data2']], ~df[['key1','key2']].isnull()).sum(axis=1)
Either one of them should yield:
# data1 data2 key1 key2 Sum over columns
# 0 2 1 NaN ab 1
# 1 5 5 a aa 10
# 2 8 9 b NaN 8
# 3 5 6 b NaN 5
# 4 7 3 a one 10
I hope this helps.

How to combine different columns in a dataframe using comprehension-python

Suppose a dataframe contains
attacker_1 attacker_2 attacker_3 attacker_4
Lannister nan nan nan
nan Stark greyjoy nan
I want to create another column called AttackerCombo that aggregates the 4 columns into 1 column.
How would I go about defining such code in python?
I have been practicing python and I reckon a list comprehension of this sort makes sense, but [list(x) for x in attackers]
where attackers is a numpy array of the 4 columns displays all 4 columns aggregated into 1 column, however I would like to remove all the nans as well.
So the result for each row instead of looking like
starknannanlannister would look like stark/lannister
I think you need apply with join and remove NaN by dropna:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join(x.dropna()), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
If need separator empty string use DataFrame.fillna:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].fillna('') \
.apply(''.join, axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Starkgreyjoy
Another 2 solutions with list comprehension - first compare by notnull and second check if string:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join([e for e in x if pd.notnull(e)]), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
#python 3 - isinstance(e, str), python 2 - isinstance(e, basestring)
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join([e for e in x if isinstance(e, str)]), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
You can set a new column in the dataframe that you will fill thanks to a lambda function:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].apply(lambda x : '{}{}{}{}'.format(x[0],x[1],x[2],x[3]), axis=1)
You don't specify how you want to aggregate them, so for instance, if you want separated by a dash:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].apply(lambda x : '{}-{}-{}-{}'.format(x[0],x[1],x[2],x[3]), axis=1)

Resources