Replace values in specified list of columns based on a condition - python-3.x

The actual use case is that I want to replace all of the values in some named columns with zero whenever they are less than zero, but leave other columns alone. Let's say in the dataframe below, I want to floor all of the values in column a and b to zero, but leave column d alone.
df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1],
'c': ['foo', 'goo', 'bar'], 'd' : [1,-2,1]})
df
a b c d
0 0 -3 foo 1
1 -1 2 goo -2
2 2 1 bar 1
The second paragraph in the accepted answer to this question: How to replace negative numbers in Pandas Data Frame by zero does provide a workaround, I can just set the datatype of column d to be non-numeric, and then change it back again afterwards:
df['d'] = df['d'].astype(object)
num = df._get_numeric_data()
num[num <0] = 0
df['d'] = df['d'].astype('int64')
df
a b c d
0 0 0 foo 1
1 0 2 goo -2
2 2 1 bar 1
but this seems really messy, and it means I need to know the list of the columns I don't want to change, rather than the list I do want to change.
Is there a way to just specify the column names directly

You can use mask and column filtering:
df[['a','b']] = df[['a','b']].mask(df<0, 0)
df
Output
a b c d
0 0 0 foo 1
1 0 2 goo -2
2 2 1 bar 1

Using np.where
cols_to_change = ['a', 'b', 'd']
df.loc[:, cols_to_change] = np.where(df[cols_to_change]<0, 0, df[cols_to_change])
a b c d
0 0 0 foo 1
1 0 2 goo 0
2 2 1 bar 1

Related

Get all columns per id where the column is equal to a value

Say I have a pandas dataframe:
id A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 1
id4 0 0 0 0
I want to select all the columns per id where the column name is equal to 1, this list will then be a new column in the dataframe.
Expected output:
id A B C D Result
id1 0 1 0 1 [B,D]
id2 1 0 0 1 [A,D]
id3 0 0 0 1 [D]
id4 0 0 0 0 []
I tried df.apply(lambda row: row[row == 1].index, axis=1) but the output of the 'Result' was not in the form in specified above
You can do what you are trying to do adding .tolist():
df['Result'] = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
Saying that, your approach of using lists as values inside a single column seems contradictory with the Pandas approach of keeping data tabular (only one value per cell). It will probably be better to use nested lists instead of pandas to do what you are trying to do.
Setup
I used a different set of ones and zeros to highlight skipping an entire row.
df = pd.DataFrame(
[[0, 1, 0, 1], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 1, 0]],
['id1', 'id2', 'id3', 'id4'],
['A', 'B', 'C', 'D']
)
df
A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 0
id4 0 0 1 0
Not Your Grampa's Reverse Binerizer
n = len(df)
i, j = np.nonzero(df.to_numpy())
col_names = df.columns[j]
positions = np.bincount(i).cumsum()[:-1]
result = np.split(col_names, positions)
df.assign(Result=[a.tolist() for a in result])
A B C D Result
id
id1 0 1 0 1 [B, D]
id2 1 0 0 1 [A, D]
id3 0 0 0 0 []
id4 0 0 1 0 [C]
Explanations
Ohh, the details!
np.nonzero on a 2-D array will return two arrays of equal length. The first array will have the 1st dimensional position of each element that is not zero. The second array will have the 2nd dimensional position of each element that is not zero. I'll call the first array i and the second array j.
In the figure below, I label the columns with what j they represent and correspondingly, I label the rows with what i they represent.
For each non-zero element of the dataframe, I place above the value a tuple with the (ith, jth) dimensional positions and in brackets the [kth] non-zero element in the dataframe.
# j → 0 1 2 3
# A B C D
# i
# ↓ (0,1)[0] (0,3)[1]
# 0 id1 0 1 0 1
#
# (1,0)[2] (1,3)[3]
# 1 id2 1 0 0 1
#
#
# 2 id3 0 0 0 0
#
# (3,2)[4]
# 3 id4 0 0 1 0
In the figure below, I show what i and j look like. I label each row with the same k in the brackets in the figure above
# i j
# ----
# 0 1 k=0
# 0 3 k=1
# 1 0 k=2
# 1 3 k=3
# 3 2 k=4
Now it's easy to slice the df.columns with the j array to get all the column labels in one big array.
# df.columns[j]
# B
# D
# A
# D
# C
The plan is to use np.split to chop up df.columns[j] into the sub arrays for each row. Turns out that the information is embedded in the array i. I'll use np.bincount to count how many non-zero elements are in each row. I'll need to tell np.bincount the minimum number of bins we are assuming to have. That minimum is the number of rows in the dataframe. We assign it to n with n = len(df)
# np.bincount(i, minlength=n)
# 2 ← Two non-zero elements in the first row
# 2 ← Two more in the second row
# 0 ← None in this row
# 1 ← And one more in the fourth row
However, if we take the cumulative sum of this array, we get the positions we need to split at.
# np.bincount(i, minlength=n).cumsum()
# 2
# 4
# 4 ← This repeated value results in an empty array for the 3rd row
# 5
Let's look at how this matches up with df.columns[j]. We see below that the column slice gets split exactly where we need.
# B D A D D ← df.columns[j]
# 2 44 5 ← np.bincount(i, minlength=n).cumsum()
One issue is that the 4 values in this array will result in splitting the df.columns[j] array into 5 sub-arrays. This isn't horrible because the last array will always be empty. so we slice it to appropriate size np.bincount(i, minlength=n).cumsum()[:-1]
col_names = df.columns[j]
positions = np.bincount(i, minlength=n).cumsum()[:-1]
result = np.split(col_names, positions)
# result
# [B, D]
# [A, D]
# []
# [D]
The only thing left to do is assign it to a new columns and make the individual sub-arrays lists instead.
df.assign(Result=[a.tolist() for a in result])
# A B C D Result
# id
# id1 0 1 0 1 [B, D]
# id2 1 0 0 1 [A, D]
# id3 0 0 0 0 []
# id4 0 0 1 0 [C]

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3
As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0
To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

Is there anyway to make more than one dummies variable at a time? [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

Index order of a shuffle dataframe

I have two DataFrame, namely A and B. Bis generated by shuffling rows of A. I would like to know each row of B, what's the index of the same row in A.
Example:
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A
a b c
0 1 1 1
1 2 2 2
2 3 3 3
B
a b c
0 2 2 2
1 3 3 3
2 1 1 1
The answer should be [1,2,0], because B equals A.loc[[1,2,0]]. I am wondering how to do this efficiently since my A and B is large.
I came up with probable solution using Dataframe.merge
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A['index_a'] = A.index
B['index_b'] = B.index
merge_df= pd.merge(A, B, left_on=['a', 'b', 'c'], right_on=['a', 'b', 'c'])
Where merge_df is
a b c index_a index_b
0 1 1 1 0 2
1 2 2 2 1 0
2 3 3 3 2 1
Now you can reference the rows from A or B Dataframe
Example
You know that row with index 0 at A is at index 2 in B
NOTE Rows that do not match on neither dataframe will not be shown in merge_df
IIUC use merge
pd.merge(B.reset_index(), A.reset_index(),
left_on = A.columns.tolist(),
right_on = B.columns.tolist()).iloc[:,-1].values
array([1, 2, 0], dtype=int64)

Resources