How to perform cumulative sum inside iterrows - python-3.x

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3

As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0

To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

Related

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

Pandas aggregate column and keep header

I have code which works but gives me data without header is there a way I can write this code so header is not removed? I know one way will be to add back header, but is there a better way?
My code:
df = pd.read_csv(“_data.csv",skiprows=[0], header=None)
df = df.groupby([2])[10].sum().astype(float)
Data:
A B
1 2
1 1
2 3
2 4
I have data like above trying to get this result:
A B
1 3
2 7
Try to use the function reset_index after the sum:
data = [{'a': 1, 'b': 2},{'a': 1, 'b': 1},{'a': 2, 'b': 3},{'a': 2, 'b': 4}]
df = pd.DataFrame(data)
df
a b
0 1 2
1 1 1
2 2 3
3 2 4
df.groupby('a').sum().reset_index()
a b
0 1 3
1 2 7
You should specify the separator (several spaces in your case) and that the header is the first row (=0, with python indexing), than groupby the column you want.
df = pd.read_csv("_data.csv", sep='\s*', header=0)
A B
0 1 2
1 1 1
2 2 3
3 2 4
df = df.groupby(['A']).sum()
B
A
1 3
2 7

Index order of a shuffle dataframe

I have two DataFrame, namely A and B. Bis generated by shuffling rows of A. I would like to know each row of B, what's the index of the same row in A.
Example:
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A
a b c
0 1 1 1
1 2 2 2
2 3 3 3
B
a b c
0 2 2 2
1 3 3 3
2 1 1 1
The answer should be [1,2,0], because B equals A.loc[[1,2,0]]. I am wondering how to do this efficiently since my A and B is large.
I came up with probable solution using Dataframe.merge
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A['index_a'] = A.index
B['index_b'] = B.index
merge_df= pd.merge(A, B, left_on=['a', 'b', 'c'], right_on=['a', 'b', 'c'])
Where merge_df is
a b c index_a index_b
0 1 1 1 0 2
1 2 2 2 1 0
2 3 3 3 2 1
Now you can reference the rows from A or B Dataframe
Example
You know that row with index 0 at A is at index 2 in B
NOTE Rows that do not match on neither dataframe will not be shown in merge_df
IIUC use merge
pd.merge(B.reset_index(), A.reset_index(),
left_on = A.columns.tolist(),
right_on = B.columns.tolist()).iloc[:,-1].values
array([1, 2, 0], dtype=int64)

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.
You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1
Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Replace values in specified list of columns based on a condition

The actual use case is that I want to replace all of the values in some named columns with zero whenever they are less than zero, but leave other columns alone. Let's say in the dataframe below, I want to floor all of the values in column a and b to zero, but leave column d alone.
df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1],
'c': ['foo', 'goo', 'bar'], 'd' : [1,-2,1]})
df
a b c d
0 0 -3 foo 1
1 -1 2 goo -2
2 2 1 bar 1
The second paragraph in the accepted answer to this question: How to replace negative numbers in Pandas Data Frame by zero does provide a workaround, I can just set the datatype of column d to be non-numeric, and then change it back again afterwards:
df['d'] = df['d'].astype(object)
num = df._get_numeric_data()
num[num <0] = 0
df['d'] = df['d'].astype('int64')
df
a b c d
0 0 0 foo 1
1 0 2 goo -2
2 2 1 bar 1
but this seems really messy, and it means I need to know the list of the columns I don't want to change, rather than the list I do want to change.
Is there a way to just specify the column names directly
You can use mask and column filtering:
df[['a','b']] = df[['a','b']].mask(df<0, 0)
df
Output
a b c d
0 0 0 foo 1
1 0 2 goo -2
2 2 1 bar 1
Using np.where
cols_to_change = ['a', 'b', 'd']
df.loc[:, cols_to_change] = np.where(df[cols_to_change]<0, 0, df[cols_to_change])
a b c d
0 0 0 foo 1
1 0 2 goo 0
2 2 1 bar 1

Resources