Change/swap values one after another in pandas dataframe for selected rows - python-3.x

Dataframe:
col1 col2
A 0
A 1
A nan
B 0
B 1
C and so on...
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2 wherever col1=='A'.
Code so far:
df.loc[(df.col1=='A') & (df.col2==0),'col2'] = 2
df.loc[(df.col1=='A') & (df.col2==1),'col2'] = 0
df.loc[(df.col1=='A') & (df.col2==2),'col2'] = 1
# Hope you understand why I am converting 0 to 2 first then to 1.
# Because if I convert all zeroes to 1 then all 1's will be converted to
# 0 in subsequent conversion.
Unique values in col2 are 0,1 and nan.
Is there a correct/better way of doing this?
Also, is there a way to directly swap these numbers instead of assignment operators?

One solution using Series.where and astype(bool) with ~ (NOT operator) and then back to astype(int). Then use loc with boolean indexing to assign back to DataFrame:
df.loc[df.col1.eq('A'), 'col2'] = df.col2.where(df.col2.isna(),
(~df.col2.astype(bool)).astype(int))
[out]
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C NaN

You can also try with df.mask():
m=df.col1.eq('A')&df.col2.isna() #condition
df.col2=1-df.col2.mask(m)
print(df)
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 1.0
4 B 0.0

I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2
wherever col1=='A'.
use np.where
df['col2] = np.where(df['col1'] == 'A', np.where(df['col2'] == 1, 0 , np.where(df['col2'].isnull() == True, df['col2'],1)),df['col2'])
Output
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C 0.0

In this case, you can also use your own function in combination with apply().
# import pandas
import pandas as pd
# make a sample data
list_of_rows = [
{'col1': A, 'col2': 1},
{'col1': A, 'col2': 0},
{'col1': A, 'col2': None},
{'col1': B, 'col2': 0},
{'col1': B, 'col2': 1},
{'col1': B, 'col2': None},
]
# make a pandas data frame
df = pd.DataFrame(list_of_rows)
# define a function
def change_values(row):
if row['col2'] == 0:
return 1
if row['col2'] == 1:
return 0
return row['col2']
# apply function to dataframe
df['col2'] = df.apply(lambda row: change_values(row), axis=1)

Related

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

fill values after condition with NaN

I have a df like this:
df = pd.DataFrame(
[
['A', 1],
['A', 1],
['A', 1],
['B', 2],
['B', 0],
['A', 0],
['A', 1],
['B', 1],
['B', 0]
], columns = ['key', 'val'])
df
print:
key val
0 A 1
1 A 1
2 A 1
3 B 2
4 B 0
5 A 0
6 A 1
7 B 1
8 B 0
I want to fill the rows after 2 in the val column (in the example all values in the val column from row 3 to 8 are replaced with nan).
I tried this:
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
and iterating over rows like this:
for row in df.iterrows():
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
but cant get it to fill nan forward.
You can use boolean indexing with cummax to fill nan values:
df.loc[df['val'].eq(2).cummax(), 'val'] = np.nan
Alternatively you can also use Series.mask:
df['val'] = df['val'].mask(lambda x: x.eq(2).cummax())
key val
0 A 1.0
1 A 1.0
2 A 1.0
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN
You can try :
ind = df.loc[df['val']==2].index
df.iloc[ind[0]:,1] = np.nan
Once you get index by df.index[df.val.shift(-1).eq(2)].item() then you can use slicing
idx = df.index[df.val.shift(-1).eq(2)].item()
df.iloc[idx:, 1] = np.nan
df
key val
0 A 1.0
1 A 1.0
2 A NaN
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN

Can I use lambda inside df.apply() to insert 1s into dataframe where combination of the index and column names are in another column of the dataframe?

I have this dataframe:
In [6]: import pandas as pd
In [7]: import numpy as np
In [8]: df = pd.DataFrame(data = np.nan,
...: columns = ['A', 'B', 'C', 'D', 'E'],
...: index = ['A', 'B', 'C', 'D', 'E'])
...:
...: df['list_of_codes'] = [['A' , 'B'],
...: ['A', 'B', 'E'],
...: ['C', 'D'],
...: ['B', 'D'],
...: ['E']]
...:
...: df
Out[8]:
A B C D E list_of_codes
A NaN NaN NaN NaN NaN [A, B]
B NaN NaN NaN NaN NaN [A, B, E]
C NaN NaN NaN NaN NaN [C, D]
D NaN NaN NaN NaN NaN [B, D]
E NaN NaN NaN NaN NaN [E]
And now I want to insert a '1' where both the index and column name are present inside of the list in the column df['list_of_codes']. The result would look like this:
A B C D E list_of_codes
A 1 1 0 0 0 [A, B]
B 1 1 0 0 1 [A, B, E]
C 0 0 1 1 0 [C, D]
D 0 1 0 1 0 [B, D]
E 0 0 0 0 1 [E]
I have tried something like this:
df.apply(lambda x: 1 if x[:-1] in (x[-1]) else 0, axis=1, result_type='broadcast')
but get the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I don't think I understand this error exactly but then I try:
df.apply(lambda x: 1 if x[:-1].any() in (x[-1]) else 0, axis=1, result_type='broadcast')
This runs but does not give me the desired result. Instead it returns:
A B C D E list_of_codes
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 0 0 0
Can someone help me understand what I need in my pd.apply() and lambda functions in order to broadcast the '1's in the way that I am trying to? Thanks in advance!
IIUC, Series.explode and then Series.str.get_dummies to check . Finally, we can use groupby.max to assign to the original dataframe
df = df.assign(**df['list_of_codes'].explode()
.str.get_dummies()
.groupby(level=0).max())
print(df)
Output
A B C D E list_of_codes
A 1 1 0 0 0 [A, B]
B 1 1 0 0 1 [A, B, E]
C 0 0 1 1 0 [C, D]
D 0 1 0 1 0 [B, D]
E 0 0 0 0 1 [E]
Alternative without explode
df = df.assign(**pd.DataFrame(df['list_of_codes'].tolist(),
index = df.index).stack()
.str.get_dummies()
.groupby(level=0)
.max())
EDIT
I think explode is somewhat faster, since in the alternative I propose at the end we are creating a dataframe and then using stack. We can rely on this post : SO explode to use explode. On the other hand we can use the level accessor instead of groupby. Well try to explode by another method of publication and find the method that provides better performance.
index = df.index
df[index] = pd.get_dummies(pd.Series(data = np.concatenate(s.values),
index = index.repeat(s.str.len()))).sum(level=0)
Another approach with pd.Index.isin:
index=df.index
df[index] = [index.isin(l).astype(int) for l in df['list_of_codes']]
I think it could be the fastest
We could also consider writing only true or false. It would be faster.
index=df.index
df[index] = [index.isin(l) for l in df['list_of_codes']]
I can not make a comment "less than 50 reputation", but I do tested ansev's solution with a 15000*15000 size df here is the way I build a test df:
import numpy as np
import pandas as pd
nelem = 15000
elements = range(nelem)
x=np.random.randint(low=1, high=len(elements), size=nelem)
list_of_codes=[]
for i in range(nelem):
list_of_codes.append(np.random.choice(elements,size=x[i]))
df = pd.DataFrame(data = {"list_of_codes":list_of_codes})
for x in elements:
df[x]=np.nan
I tested it on the colab it gave me this outcome:
%timeit df[index] = [index.isin(l) for l in df['list_of_codes']]
The slowest run took 26.21 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 3.04 s per loop
So ansev's solution does work in your case.

Delete row from dataframe having "None" value in all the columns - Python

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?
There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b
You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

Pandas Use Value if Not Null, Else Use Value From Next Column

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

Resources