How to compare across multiple columns containing list value in Python Pandas?

How to compare across multiple columns containing list value in Python Pandas? - python-3.x

I have the following sample data
ID VAR1 VAR2 VAR3 DATE
1 NaN [Timestamp('2012-08-03'), 'M'] [Timestamp('2012-08-03'), 'M'] 2012-08-03
2 [Timestamp('2009-04-01'), 'F'] NaN [Timestamp('2009-04-03'), 'F'] 2009-04-01
3 NaN [Timestamp('2004-01-01'), 'M'] NaN 2004-01-01
4 NaN [Timestamp('2004-02-15'), 'M'] [Timestamp('2000-08-07'), 'M'] 2000-08-07
For each row, I want to go through VAR1, VAR2, and VAR3 and have each to compare against the DATE. Each of the three columns would either have a np.nan (missing value) or a list value (containing a date and gender). I want to compare the first element of the list against the DATE column. If the first-element date is more than a day difference than the DATE value, I want to replace that cell value as np.nan.
I like to use Pandas' apply function as I am clear with the underlying logics.
The desired processed df should be as follows:
ID VAR1 VAR2 VAR3 DATE
1 NaN [Timestamp('2012-08-03'), 'M'] [Timestamp('2012-08-03'), 'M'] 2012-08-03
2 [Timestamp('2009-04-01'), 'F'] NaN NaN 2009-04-01
3 NaN [Timestamp('2004-01-01'), 'M'] NaN 2004-01-01
4 NaN NaN [Timestamp('2000-08-07'), 'M'] 2000-08-07
This is my working code
df = df.apply(remove_value_if_unmatched_against_index_date, axis=1)
def remove_value_if_unmatched_against_index_date(df):
vars = ['VAR1', 'VAR2', 'VAR3']
for var in vars:
if isinstance(df[var], list): # doesn't work
# if df[var].notnull(): # doesn't work
# if df[var] != np.nan: # doesn't work
if abs(df[var][0] - df['DATE']) >= timedelta(days=1):
df[var] = np.nan
return df
The problem is none of the followings (if isinstance(df[var], list):, if df[var].notnull():, and if df[var] != np.nan:) works to help check if there is a list value within the cell.

Try with bfill
df['new'] = df.bfill(axis=1)['VAR1'].str[0]

Related

Keeping columns of pandas dataframe whose substring is in the list

I have a dataframe with many columns. I only want to retain those columns whose substring is in the list. For example the lst and dataframe is:
lst = ['col93','col71']
sample_id. col9381.3 col8371.8 col71937.9 col19993.1
1
2
3
4
Based on the substrings, the resulting dataframe will look like:
sample_id. col9381.3 col71937.9
1
2
3
4
I have a code that go through the list and filter out the columns for whom I have a substring in a list but I don't know how to create a dataframe for it. The code so far:
for i in lst:
df2 = df1.filter(regex=i)
if df2.shape[1] > 0:
print(df2)
The above code is able to filter out the columns but I don't know how combine all of these into one dataframe. Insights will be appreciated.

Try with startswith which accepts a tuple of options:
df.loc[:, df.columns.str.startswith(('sample_id.',)+tuple(lst))]
Or filter which accepts a regex as you were trying:
df.filter(regex='|'.join(['sample_id']+lst))
Output:
sample_id. col9381.3 col71937.9
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0

Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64

Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64

Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Pandas select preferred value from one of two columns to make a new column

I have a Pandas DataFrame with two columns of "complementary" data. For any given row, there are 3 possibilities:
1) Column A has a non-null value, and column B has a null value, NaN, that I want to replace with the non-null value from column A.
2) Column A has a null value, NaN, that I want to replace with the non-null value from column B.
3) Both columns A and B have null values, NaN, which means I'll keep NaN as the value for that row.
Here's a simplified version of my DataFrame:
df1 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan]})
I was thinking that as an intermediate step, I'd create a new column C with the entries I need:
df2 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan],
'C' : ['keep1', 'keep2', np.nan, 'keep4', np.nan]}
Then I'd drop the first two rows A and B:
df_final = df2.drop(['A', 'B'], axis=1)
My actual DataFrame has hundreds of rows, and I've tried several approaches (boolean filters, looping through the DataFrame using iterrows, using DataFrame.where()) without success. I'd think this would be a simple problem, but I'm not seeing it. Any help is appreciated.
Thanks

You can use combine_first() to fill the gaps in A from B:
df1['C'] = df1['A'].combine_first(df1['B'])
#0 keep1
#1 keep2
#2 NaN
#3 keep4
#4 NaN

Use Series.fillna for replace missing values from A by B values:
df1['C'] = df1.A.fillna(df1.B)
print (df1)
A B C
0 keep1 NaN keep1
1 NaN keep2 keep2
2 NaN NaN NaN
3 keep4 NaN keep4
4 NaN NaN NaN
For avoid drop is possible use DataFrame.pop for extract columns:
df1['C'] = df1.pop('A').fillna(df1.pop('B'))
print (df1)
C
0 keep1
1 keep2
2 NaN
3 keep4
4 NaN

How to sum columns in python based on column with not empty string

df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3],
'Sum over columns':[1,10,8,5,10]})
Hi everybody, could you please help me with following issue:
I'm trying to sum over columns to get a sum of data1 and data2.
If column with string (key1) is not NaN and if column with string (key2) is not NaN then sum data1 and data2. The result I want is shown in the sum column. Thank your for your help!

Try using the .apply method of df on axis=1 and numpy's array multiplication function to get your desired output:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3]})
df['Sum over columns'] = df.apply(lambda x: np.multiply(x[0:2], ~x[2:4].isnull()).sum(), axis=1)
Or:
df['Sum over columns'] = np.multiply(df[['data1','data2']], ~df[['key1','key2']].isnull()).sum(axis=1)
Either one of them should yield:
# data1 data2 key1 key2 Sum over columns
# 0 2 1 NaN ab 1
# 1 5 5 a aa 10
# 2 8 9 b NaN 8
# 3 5 6 b NaN 5
# 4 7 3 a one 10
I hope this helps.

How to combine different columns in a dataframe using comprehension-python

Suppose a dataframe contains
attacker_1 attacker_2 attacker_3 attacker_4
Lannister nan nan nan
nan Stark greyjoy nan
I want to create another column called AttackerCombo that aggregates the 4 columns into 1 column.
How would I go about defining such code in python?
I have been practicing python and I reckon a list comprehension of this sort makes sense, but [list(x) for x in attackers]
where attackers is a numpy array of the 4 columns displays all 4 columns aggregated into 1 column, however I would like to remove all the nans as well.
So the result for each row instead of looking like
starknannanlannister would look like stark/lannister

I think you need apply with join and remove NaN by dropna:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join(x.dropna()), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
If need separator empty string use DataFrame.fillna:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].fillna('') \
.apply(''.join, axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Starkgreyjoy
Another 2 solutions with list comprehension - first compare by notnull and second check if string:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join([e for e in x if pd.notnull(e)]), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
#python 3 - isinstance(e, str), python 2 - isinstance(e, basestring)
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join([e for e in x if isinstance(e, str)]), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy

You can set a new column in the dataframe that you will fill thanks to a lambda function:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].apply(lambda x : '{}{}{}{}'.format(x[0],x[1],x[2],x[3]), axis=1)
You don't specify how you want to aggregate them, so for instance, if you want separated by a dash:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].apply(lambda x : '{}-{}-{}-{}'.format(x[0],x[1],x[2],x[3]), axis=1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to compare across multiple columns containing list value in Python Pandas? - python-3.x

Try with bfill df['new'] = df.bfill(axis=1)['VAR1'].str[0]

Related

Keeping columns of pandas dataframe whose substring is in the list

Summing up two columns of pandas dataframe ignoring NaN

Pandas select preferred value from one of two columns to make a new column

How to sum columns in python based on column with not empty string

How to combine different columns in a dataframe using comprehension-python

Categories

Resources